Skip to main content
Erschienen in:

Open Access 2025 | OriginalPaper | Buchkapitel

Variable Elimination as Rewriting in a Linear Lambda Calculus

verfasst von : Thomas Ehrhard, Claudia Faggian, Michele Pagani

Erschienen in: Programming Languages and Systems

Verlag: Springer Nature Switzerland

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Das Kapitel beginnt mit der Betonung der Aussagekraft probabilistischer Programmiersprachen (PPLs) bei stochastischer Modellierung und bayesianischem Denken und betont die rechnerischen Herausforderungen exakter Folgerungen. Es führt die variable Eliminierung als entscheidenden Algorithmus für exakte Rückschlüsse auf diskrete grafische Modelle ein und diskutiert ihre Effizienz und die Bedeutung der Eliminierungsordnung. Der Kern des Kapitels stellt einen neuartigen Ansatz dar, die Eliminierung von Variablen als einen Prozess des Umschreibens in einer linearen Lambda-Kalkulation zu verstehen, der einen Inferenz-Algorithmus definiert, der den klassischen Algorithmus als Programmtransformation formalisiert. Dieser Ansatz integriert denotationale Semantik, um den Faktorisierungsalgorithmus zu validieren und so Solidität und Vollständigkeit zu gewährleisten. Das Kapitel untersucht auch das Rechenverhalten des Algorithmus auf stochastischen Programmen, die bayesianische Netzwerke repräsentieren, und demonstriert seine Effizienz und das Potenzial zur Verbesserung der Infolgekosten durch Programmtransformationen. Darüber hinaus werden die Implikationen dieses grundlegenden Verständnisses für die Entwicklung universeller Algorithmen für exakte Folgerungen in probabilistischen Programmiersprachen diskutiert, die den Weg für effizientere und skalierbarere stochastische Modellierungstechniken ebnen.
Hinweise

Supplementary Information

The online version contains supplementary material available at https://​doi.​org/​10.​1007/​978-3-031-91118-7_​12.

1 Introduction

Probabilistic programming languages (PPLs) provide a rich and expressive framework for stochastic modeling and Bayesian reasoning. The crucial but computationally hard task is that of inference, i.e. computing explicitly  the probability distribution which is implicitly specified by the probabilistic program. Most PPLs focus on continuous random variables—in this setting the inference engine typically implements approximate inference algorithms based on sampling methods (such as importance sampling, Markov Chain Monte Carlo, Gibbs sampling). However, several domains of application (e.g. network verification, ranking and voting, text or graph analysis) are naturally discrete, yielding to an increasing interest in the challenge of exact inference [17, 19, 21, 34, 35, 39, 44, 46]. A good example is Dice [21], a first-order functional language whose inference algorithm exploits the structure of the program in order to factorise inference, making it possible to scale exact inference to large distributions. A common ground to most exact approaches is to be inspired by techniques for exact inference on discrete graphical models, which typically exploit probabilistic independence as the key for compact representation and efficient inference.
Indeed, specialized formalisms do come with highly efficient algorithms for exact inference; a prominent example is that of Bayesian networks, which enable algorithms such as Message Passing [37] and Variable Elimination [45]—to name two classical ones—and a variety of approaches for exploiting local structure, such as reducing inference to Weighted Model Counting [3, 4]. General-purpose programming language do provide a rich expressiveness, which allows in particular for the encoding of Bayesian networks, however, the corresponding algorithms are often lost when leaving the realm of graphical models for PPLs, leaving an uncomfortable gap between the two worlds. Our goal is shedding light in this gray area, understanding exact inference as an algorithm acting on programs.
In pioneering work, Koller et al. [27] define a general purpose functional language which not only is able to encode Bayesian networks (as well as other specialized formalisms), but also comes with an algorithm which mimics Variable Elimination (\(\textsf{VE}\) for short) by means of term transformation. \(\textsf{VE}\) is arguably the simplest algorithm for exact inference, which is factorised into smaller intermediate computations, by eliminating the irrelevant variables according to a specific order. The limit in [27] is that unfortunately, the algorithm there can only implement a specific elimination ordering (the one determined by the lazy evaluation implicit in the algorithm), which might not be the most efficient: a different ordering might result in smaller intermediate factors. The general problem to be able to deal with any possible ordering, hence producing any possible factorisation, is there left as an open challenge for further investigation. The approach that is taken by the authors in a series of subsequent papers will go in a different direction from term rewriting; eventually in [39] programs are compiled into an intermediate structure, and it is on this graph structure that a sophisticated variant of \(\textsf{VE}\) is performed. The question of understanding \(\textsf{VE}\) as a transformation on programs remains still open; we believe it is important for a foundational understanding of PPLs.
In this paper, we provide an answer, defining an inference algorithm which fully formalizes the classical \(\textsf{VE}\) algorithm as rewriting of programs, expressed in an idealized probabilistic functional language—a linear simply-typed \(\lambda \)-calculus suffices for our purpose. Formally, we prove soundness and completeness of our algorithm with respect to the standard one. Notice that the choice of the elimination order is not part of a \(\textsf{VE}\) algorithm—several heuristics are available in the literature to compute an efficient elimination order (see e.g. [8]). As wanted, we prove that any given elimination ordering can be implemented by our algorithm. When we run it on a stochastic program representing a Bayesian network, its computational behaviour is the same as that of standard \(\textsf{VE}\) for Bayesian networks, and the cost is of the same complexity order. While the idea behind \(\textsf{VE}\) is simple, crafting an algorithm on terms which is able to implement any elimination order is non-trivial— our success here relies on the use of linear types \(P\multimap T\) enabled by linear logic [18], accounting for the interdependences generated by a specific elimination order. Let us explain the main ideas.
Factorising Inference, via the graph structure. Bayesian networks describe a set of random variables and their conditional (in)dependencies. Let us restrict ourselves to boolean random variables, i.e. variables x representing a boolean value \(\texttt{t}\) or \(\texttt{f}\) with some probability.1 Such a variable can be described as a vector of two non-negative real numbers \((\rho _\texttt{t}, \rho _\texttt{f})\) quantifying the probability \(\rho _\texttt{t}\) (resp. \(\rho _\texttt{f}\)) of sampling \(\texttt{t}\) (resp. \(\texttt{f}\)) from x.
Fig. 1.
Example of running the \(\textsf{VE}\) algorithm on a let-term \(\ell \).
Fig. 1a depicts an example of Bayesian network. It is a directed acyclic graph \(\mathcal {G}\) where the nodes are associated with random variables and where the arrows describe conditional dependencies between these variables. For instance, in Fig. 1a the variable \(x_5\) depends on the values sampled from \(x_3\) and \(x_4\), and, in turn, it affects the probability of which boolean we can sample from \(x_6\). The network does not give a direct access to a vector \((\rho _\texttt{t}, \rho _\texttt{f})\) describing \(x_5\) by its own, but only to a stochastic matrix \(\mathtt M_5\) quantifying the conditional dependence of \(x_5\) with respect to \(x_3\) and \(x_4\). Formally, \(\mathtt M_5\) is a matrix with four rows, representing the four possible outcomes of a joint sample of \(x_3\) and \(x_4\) (i.e. \((\texttt{t}, \texttt{t})\), \((\texttt{t}, \texttt{f})\), \((\texttt{f}, \texttt{t})\), \((\texttt{f}, \texttt{f})\)), and two columns, representing the two possible outcomes for \(x_5\). The matrix is stochastic in the sense that each line represents a probabilistic distribution of booleans: for instance, \((\mathtt M_5)_{(\texttt{t}, \texttt{f}), \texttt{t}}=0.4\) and \((\mathtt M_5)_{(\texttt{t}, \texttt{f}), \texttt{f}}=0.6\) mean that \(\texttt{t}\) can be sampled from \(x_5\) with a 40% chance, while \(\texttt{f}\) with 60%, whenever \(x_3\) has been observed to be \(\texttt{t}\) and \(x_4\) to be \(\texttt{f}\).
Having such a graph \(\mathcal {G}\) and the stochastic matrices \(\mathtt M_1\), \(\mathtt M_2\), \(\mathtt M_3\), etc. associated with its nodes, a typical query is to compute the joint probability of a subset of the variables of \(\mathcal {G}\). For example, the vector \(\Pr (x_3,x_6)=(\rho _{(\texttt{t}, \texttt{t})},\rho _{(\texttt{t}, \texttt{f})},\rho _{(\texttt{f}, \texttt{t})},\rho _{(\texttt{f}, \texttt{f})})\) giving the marginal over \(x_3\) and \(x_6\), i.e. the probability of the possible outcomes of \(x_3\) and \(x_6\). A way to obtain this is first computing the joint distributions of all variables in the graph in a single shot and then summing out the variable we are not interested in. This will give, for every possible boolean value \(b_3,b_6\) in \(\{\texttt{t},\texttt{f}\}\) taken by, respectively, \(x_3\) and \(x_6\), the following expression:
$$\begin{aligned} \sum _{ \begin{array}{c} b_1,b_2,b_4, b_5\in \{\texttt{t}, \texttt{f}\} \end{array} } (\mathtt M_1)_{b_1} (\mathtt M_2)_{b_1, b_2} (\mathtt M_3)_{b_2, b_3} (\mathtt M_4)_{b_4} (\mathtt M_5)_{(b_3, b_4), b_5} (\mathtt M_6)_{(b_2, b_5), b_6}\,. \end{aligned}$$
(1)
For each of the \(2^2\) possible values of the indexes \((b_3, b_6)\) we have a sum of \(2^4\) terms. That is, to compute the joint probability of \((x_3, x_6)\), we have to compute \(2^6\) entries. This method is unfeasible in general as it requires a number of operations exponential in the size of \(\mathcal {G}\). Luckily, one can take advantage of the conditional (in)dependencies underlined by \(\mathcal {G}\) to get a better factorisation than in (1), breaking the computation in that of factors of smaller size. For example:
$$\begin{aligned} \sum _{b_5}\! \left( \!\sum _{b_2}\! \left( \!\sum _{b_1} (\mathtt M_1)_{b_1} (\mathtt M_2)_{b_1, b_2} \!\right) (\mathtt M_3)_{b_2, b_3} (\mathtt M_6)_{(b_2, b_5), b_6} \!\right) \! \left( \! \sum _{b_4} (\mathtt M_4)_{b_4} (\mathtt M_5)_{(b_3, b_4), b_5} \!\right) \!. \end{aligned}$$
(2)
Let us denote by \(\phi ^i\) the intermediate factor in (2) identified by the sum over \(b_i\): for example \(\phi ^2\) is the sum \(\sum _{b_2}\phi ^1_{b_2}(\mathtt M_3)_{b_2, b_3}(\mathtt M_6)_{(b_2, b_5), b_6}\). Notice that if we suppose to have memorised the results of computing \(\phi ^1\), to obtain \(\phi ^2\) requires to compute \(2^4\) entries, i.e. the cost is exponential in the number of the different indexes \(b_j\)’s appearing in the expression defining \(\phi ^2\). By applying the same reasoning to all factors in (2), one notices that in the whole computation we never need to compute more than \(2^4\) entries: we have gained a factor of \(2^2\) with respect to (1).
The Variable Elimination algorithm performs factorisations like (2) in order to compute more efficiently the desired marginal distribution. The factorisation is characterised by an ordered sequence of unobserved (or marginalised) variables to eliminate, i.e. to sum out. The factorisation in (2) is induced by the sequence \((x_1,x_2,x_4,x_5)\): \(\phi ^1\) eliminates variable \(x_1\), \(\phi ^2\) then eliminates variable \(x_2\), and so on. Different orders yield different factorisations with different performances, e.g. the inverse order \((x_5,x_4,x_2,x_1)\) is less efficient, as the largest factor here requires to compute \(2^5\) entries.
Factorising Inference, via the program structure. In the literature, factorisations are usually described as collections of factors (basically vectors) and the \(\textsf{VE}\) algorithm is presented as an iterative algorithm acting on such collections. In this paper we propose a framework that gives more structure to this picture, expressing \(\textsf{VE}\) as a program transformation, building a factorisation by induction on the structure of a program, compositionally. More precisely:
  • we define a fragment \(\mathcal {L}\) of the linear simply-typed \(\lambda \)-calculus, which is able to represent any factorisation of a Bayesian network query as a \(\lambda \)-term. Random variables are associated with term variables of ground type;
  • we express the \(\textsf{VE}\) algorithm as rewriting over the \(\lambda \)-terms in \(\mathcal {L}\), consisting in reducing the scope of the variables that have to be eliminated.
Our approach integrates and is grounded on the denotational semantics of the terms, which directly reflects and validates the factorisation algorithm—yielding soundness and completeness. We stress that inference is computing the semantics of the program (the marginal distribution defined by it).
The reader can easily convince herself that the query about the joint marginal distribution \((x_3, x_6)\) to the Bayesian network in Fig. 1a can be expressed by the let-term \(\ell \) in Fig. 1b, where we have enriched the syntax of \(\lambda \)-terms with the constants representing the stochastic matrices. We consider \(\texttt{let}\, x\, =\, e\, \texttt{in}\, e'\) as a syntactic sugar for \((\lambda x. e')e\), so the term \(\ell \) can be seen as a \(\lambda \)-term, which is moreover typable in a linear type system like the one in Fig. 2. The fact that some variables \(x_i\) have more free occurrences in sub-terms of \(\ell \) is not in contrast with the linearity feature of the term, as let-expressions are supposed to be evaluated following a call-by-value strategy, and ground values (as e.g. booleans) can be duplicated in linear systems (see Ex. 1 and Remark 2 for more details).
Terms of this type are associated in quantitative denotational semantics such as [7, 29] with algebraic expressions which give the joint distribution of the output variables. Here, in the same spirit as [11], we adopt a variant of the quantitative denotation (Sect. 3), which can be seen as a compact reformulation of the original model, and is more suitable to deal with factorised inference. It turns out that when we compositionally compute the semantics of \(\ell \) following the structure of the program, we have a more efficient computation than in (1). This is because the inductive interpretation yields intermediate factors of smaller size, similarly to what algorithms for exact inference do. Indeed, the inductive interpretation of \(\ell \) behaves like \(\textsf{VE}\) with the elimination order \((x_5,x_4,x_2,x_1)\), see Ex. (5).
Notice that different programs may encode the same model and query, but with a significantly different inference cost, due to their different structure. A natural question is then to wonder if we can directly act on the structure of the program, in such a way that the semantics is invariant, but inference is more efficient. In fact, we show that the language \(\mathcal {L}\) is sufficiently expressive to represent all possible factorisations of (1), e.g. (2). The main idea for such a representation arises from the observation that summing-out variables in the semantics corresponds in the syntax to make a let-in definition local to a sub-expression. For example, the factor \(\phi ^1 = \sum _{b_1}(\mathtt M_1)_{b_1}(\mathtt M_2)_{b_1, b_2}\) of (2) can be easily obtained by making the variable \(x_1\) local to the definition of \(x_2\), creating a \(\lambda \)-term \(e^1\) of the shape \(\texttt{let}\, x_1\, =\, \mathtt M_1\, \texttt{in}\, \mathtt M_2 x_1\) and replacing the first two definitions of \(\ell \) with \(\texttt{let}\, x_2\, =\, e^1\, \texttt{in}\, \dots \). In fact, the denotation of \(e^1\) is exactly \(\phi ^1\). What about \(\phi ^2\)? Here the situation is subtler as in order to make local the definition of \(x_2\) one should gather together the definitions of \(x_3\) and \(x_6\), but the definition of \(x_6\) depends on a variable \(x_5\) which in turn depends on \(x_3\), so a simple factor of ground types (i.e. tensors of booleans) will generate a dependence cycle. Luckily, we can use a (linear) functional type, defining a \(\lambda \)-term \(e^2\) as \(\texttt{let}\, x_2\, =\, e^1\, \texttt{in}\, (\mathtt M_3 x_2, \lambda x_5.\mathtt M_6 (x_2, x_5))\) and then transforming \(\ell \) into Fig. 1c. Again, we can notice that the denotation of \(e^2\) is exactly \(\phi ^2\). Fig. 7 details2 the whole rewriting mimicking the elimination of the variables \((x_1, x_2, x_4, x_5)\) applied to \(\ell \). This paper shows how to generalise this reasoning to any let-term.
Contents of the paper. We present an algorithm which is able to fully perform \(\textsf{VE}\) on programs: for any elimination order, program \( \ell \) is rewritten by the algorithm into program \( \ell '\), representing—possibly in a more efficient way—the same model. As stressed, our investigation is of foundational nature; we focus on a theoretical framework in which we are able to prove the soundness and completeness (Th. 1, Cor. 1) of the \(\textsf{VE}\) algorithm on terms. To do so, we leverage on the quantitative denotation of the terms. The structure of the paper is as follows.
– Sect. 2 defines the linear \(\lambda \)-calculus \(\mathcal {L}\), and its semantics. In particular, Fig. 2 gives the linear typing system, which is a fragment of multiplicative linear logic [18] (Remark 2). Fig. 4 sketches the denotational semantics of \(\mathcal {L}\) as weighted relations [29]. At first, the reader who wishes to focus on \(\textsf{VE}\) can skip the formal details about the semantics, and just read the intuitions in Sect. 2.2.
– Sect. 3 formalises the notion of factorisation as a set of factors (Def. 1) and shows how to associate a factorisation to the let-terms in \(\mathcal {L}\) (Def. 4). Def. 5 recalls the standard \(\textsf{VE}\) algorithm—acting on sets of factors—denoted by \(\textsf{VE}^{\mathsf F}\).
– Sect. 4 is the core of our paper, capturing \(\textsf{VE}\) as a let-term transformation (Def. 8), denoted here by \(\textsf{VE}^\mathcal {L}\), using the rewriting rules of Fig. 6. We prove the correspondence between the two versions of \(\textsf{VE}\) in Th. 1 and Cor. 1, stating our main result, the soundness and completeness of \(\textsf{VE}^\mathcal {L}\) with respect to \(\textsf{VE}^{\mathsf F}\).
As an extra bonus, an enrichment of the semantics—based on probabilistic coherence spaces [7]—yields a neat property of the terms of \(\mathcal {L}\), namely that the total mass of the denotation is easily computable from the terms type (Prop. 1).
Missing proofs and more technical details are in the extended version [12].
Related work. Variants of factorisation algorithms were invented independently in multiple communities (see [28] for a survey). The algorithm of Variable Elimination (\(\textsf{VE}\)) was first formalised in [45]. The approach to \(\textsf{VE}\) which is usually taken by PPLs is to compile a program into an intermediate structure, on which \(\textsf{VE}\) is performed. Our specific contribution is to provide the first algorithm which fully performs \(\textsf{VE}\) directly on programs. As explained, by this we mean the following. First, we observe that the inductive interpretation of a term behaves as \(\textsf{VE}\), for an ordering of the variables to eliminate which is implicit in the structure of the program—possibly a non-efficient one. Second, our algorithm transforms the program in such a way that its structure reflects \(\textsf{VE}\) according to any arbitrary ordering, while the semantics (the denoted model) is invariant.
As we discussed before, our work builds on the programme put forward in [27]. Pfeffer [39](page 417) summarizes this way the limits of the algorithm in [27]: "The the solution is only partial. Given a BN encoded in their language, the algorithm can be viewed as performing Variable Elimination using a particular elimination order: namely, from the last variable in the program upward. It is well-known that the cost of VE is highly dependent on the elimination order, so the algorithm is exponentially more expensive for some families of models than an algorithm that can use any order." The algorithm we present here achieves a full solution: any elimination order can be implemented.
The literature on probabilistic programming languages and inference algorithms is vast, even restricting attention to exact inference. At the beginning of the Introduction we have mentioned several relevant contributions. Here we briefly discuss two lines of work which are especially relevant to our approach.
Rewriting the program to improve inference efficiency, as in [19]. A key goal of PPLs is to separate the model description (the program) from the inference task. As pointed out by [19], such a goal is hard to achieve in practice. To improve inference efficiency, users are often forced to re-write the program by hand.
Exploiting the local structure of the program to achieve efficient inference, as in [21]. This road is taken by the authors of the language Dice—here the algorithm does not act on the program itself.
Our work incorporates elements from both lines: we perform inference compositionally, following the inductive structure of the program; to improve efficiency, our rewriting algorithm modifies the program structure (modelling the \(\textsf{VE}\) algorithm),while keeping the semantics invariant.
As a matter of fact, our first-order language is very similar to the language Dice [21], and has similar expressiveness. In order to keep presentation and proofs simple, we prefer to omit a conditioning construct such as observe, but it could easily be accommodated (see Sect. 5). There are however significant differences. We focus on \(\textsf{VE}\), while Dice implements a different inference algorithm, compiling programs to weighted Boolean formulas, then performing Weighted Model Counting [3]. Moreover, as said, Dice exploits the local structure of the given program, without program transformations to improve the inference cost, which is instead at the core of our approach.
Rewriting is central to [19]. The focus there is on probabilistic programs with mixed discrete and continuous parameters: by eliminating the discrete parameters, general (gradient-based) algorithms can then be used. To automate this process, the authors introduce an information flow type system that can detect conditional independencies; rewriting uses \(\textsf{VE}\) techniques, even though the authors are not directly interested in the equivalence with the standard \(\textsf{VE}\) algorithm (this is left as a conjecture). In the same line, we mention also a very recent work [31] which tackle a similar task as [19]; while using similar ideas, a new design in both the language and the information flow type system allows the authors to deal with bounded recursion. The term transformations have a different goal than ours, compiling a probabilistic program into a pure one. However, some key elements there resonate with our approach: program transformations are based on continuation passing style (we use arrow variables in a similar fashion); the language in [31] is not defined by an operational semantics, instead the authors —like us— adopt a compositional, denotational treatment.
Denotational semantics versus cost-awareness. Our approach integrates and is grounded on a quantitative denotational semantics. Pioneering work by [24, 25] has paved the way for a logical and semantical comprehension of Bayesian networks and inference from a categorical perspective, yielding an extensive body of work based on the setting of string diagrams, e.g. [5, 22, 23]. A denotational take on Bayesian networks is also at the core of [36], and underlies the categorical framework of [43]. These lines of research however do not take into consideration the computational cost, which is the very reason motivating the introduction and development of Bayesian networks, and inference algorithms such as \(\textsf{VE}\). In the literature, foundational understanding tends to focus on either a compositional semantics or on efficiency, but the two worlds are separated, and typically explored as independent entities. This dichotomy stands in stark contrast to Bayesian networks, where the representation, the semantics (i.e. the underlying joint distribution), and the inference algorithms are deeply intertwined. A new perspective has been recently propounded by [11], advocating the need for a quantitative semantical approach more attentive to the resource consumption and to the actual cost of computing the semantics, which here exactly corresponds to performing inference. Our contribution fits in this line, which inspires also the cost-aware semantics in [16]. The latter introduces a higher-order language -in the idealized form of a \(\lambda \)-calculus-which is sound and complete w.r.t. Bayesian networks, together with a type system which computes the cost of (inductively performed) inference. Notice that [16] does not deal with terms transformations to rewrite a program into a more efficient one. Such transformations, reflecting the essence of the \(\textsf{VE}\) algorithm, is exactly the core of our paper — our algorithm easily adapts to the first-order fragment of [16].

2 \(\mathcal {L}\) Calculus and Let-Terms

We consider a linear simply typed \(\lambda \)-calculus extended with stochastic matrices over tuples of booleans. Our results can be extended to more general systems, but here we focus on the core fragment able to represent Bayesian networks and the factorisations produced by the \(\textsf{VE}\) algorithm. In particular, we adopt a specific class of types, where arrow types are restricted to linear maps from (basically) tuples of booleans (i.e.  the values of positive types in the grammar below) to pairs of a tuple of booleans and, possibly, another arrow.

2.1 Syntax

We consider the following grammar of types:
$$\begin{aligned} P, Q, \dots &\mathrel {::=}\textsf{Bool}\mid P\otimes Q&\text {(positive types)}\\ A, B, \dots &\mathrel {::=}P\multimap T&\text {(arrow types)}\\ T, S, \dots &\mathrel {::=}P\mid A\mid P\otimes T&\text {(let-term types)} \end{aligned}$$
It is convenient to adopt a typing system à la Church, i.e.  we will consider type annotated variables, meaning that we fix a set of variables and a function \(\textsf{ty}\) from this set to the set of positive and arrow types (see e.g. [20, ch.10], this style is opposed to the typing system à la Curry, where types should be associated to variables by typing contexts). We call a variable v positive (resp. arrow) whenever \(\textsf{ty}(v)\) is a positive (resp. arrow) type. We use metavariables x, y, z (resp. f, g, h) to range over positive (resp. arrow) variables. The letters v, w will be used to denote indistinctly positive or arrow variables.
The syntax of \(\mathcal {L}\) is given by the following 3-sorted grammar, where \(\mathtt M\) is a metavariable corresponding to a stochastic matrix between tuples of booleans:
$$\begin{aligned} \boldsymbol{v}&\mathrel {::=}v \mid (\boldsymbol{v}, \boldsymbol{v}')\quad \qquad \qquad \text {if } \textsf{FV}(\boldsymbol{v}) \cap \textsf{FV}(\boldsymbol{v}') =\emptyset & &\text {(patterns)} \\ e&\mathrel {::=}v \mid \mathtt M(\boldsymbol{x}) \mid f \boldsymbol{x} \mid (e,e') \mid \lambda \boldsymbol{x}.e \mid \texttt{let}\, \boldsymbol{v}\, =\, e\, \texttt{in}\, e' & &\text {(expressions)} \\ \ell &\mathrel {::=}\boldsymbol{v} \mid \texttt{let}\, \boldsymbol{v}\, =\, e\, \texttt{in}\, \ell & &\text {(let-terms)} \end{aligned}$$
Notice that a pattern is required to have pairwise different variables. We allow 0-ary stochastic matrices, representing random generators of boolean tuples, which we will denote simply by \(\mathtt M\), instead of \(\mathtt M()\). We can assume to have two 0-ary stochastic matrices \(\texttt{t}\) and \(\texttt{f}\) representing the two boolean values.
We denote by \(\textsf{FV}(e) \) the set of free variables of an expression e. As standard, \(\lambda \)-abstractions \(\lambda \boldsymbol{v}.e\) and let-in \(\texttt{let}\, \boldsymbol{v}\, =\, e'\, \texttt{in}\, e\) bind in the subexpression e all occurrences of the variables in the pattern \(\boldsymbol{v}\). In fact, \(\texttt{let}\, \boldsymbol{v}\, =\, e'\, \texttt{in}\, e\) can be thought as syntactic sugar for \((\lambda \boldsymbol{v}.e)e'\). Given a set of variables \(\mathcal {V}\), we denote by \(\mathcal {V}^a\) (resp. \(\mathcal {V}^+\)) the subset of the arrow variables (resp. positive variables) in \(\mathcal {V}\), in particular \(\textsf{FV}(e)^{a}\) denotes the set of arrow variables free in e.
Patterns are a special kind of let-terms and these latter are a special kind of expressions. A pattern is called positive if all its variables are positive. We use metavariables \(\boldsymbol{x}, \boldsymbol{y}, \boldsymbol{z}\) to range over positive patterns. A let-term is positive if its rightmost pattern is positive, i.e.: \(\texttt{let}\, \boldsymbol{v}\, =\, e\, \texttt{in}\, \ell \) is positive if \(\ell \) is positive.
Fig. 2.
Typing rules: the binary rules suppose that the set of the free arrow variables of the subterms are disjoint; the let-rule binding an arrow variable f requires also that this variable f is free in the expression \(e'\).
Fig. 2 gives the rules generating the set of well-typed expressions (and so including patterns and let-terms). As standard in typing systems à la Church, we omit an explicit typing environment of the typing judgment \(e:T\), as this can be recovered from the typing of the free variables of e, i.e. if \(\textsf{FV}(e) =\{x_1,\dots ,x_n\}\), then à la Curry we would write \(e:T\) by \(x_1:\textsf{ty}(x_1),\dots ,x_n:\textsf{ty}(x_1)\vdash e:T\). See Fig. 3 for an example of typing derivation in both styles.
The binary rules suppose the side condition \(\textsf{FV}(e)^{a}\cap \textsf{FV}(e')^{a}=\emptyset \) and the let-in rule binding an arrow variable f has also the condition \(f\in \textsf{FV}(e')^{a}\). These conditions guarantee that arrow variables are used linearly, ensuring the linear feature of the typing system.
Example 1
Fig. 3.
An example of type derivation, in both Church and Curry style.
Consider the term \( \texttt{let}\, v'\, =\, v\, \texttt{in}\, (v,v')\), which will duplicate any value assigned to the free variable v. If v has positive type (e.g.  boolean), it admits the type derivation in Fig. 3. On the contrast, if v has arrow type, no type derivation is possible, in agreement with the fact that arrows can only occur linearly. See also the discussion in Ex. 4.
Notice that \(\lambda \)-abstractions are restricted to positive patterns. Also, a typing derivation of conclusion \(e:T\) is completely determined by its expression. This means that if an expression can be typed, then its type is unique. Because of that, we can extend the function \(\textsf{ty}\) to all expressions, i.e.\(\textsf{ty}(e)\) is the unique type such that \(e:\textsf{ty}(e)\) is derivable, whenever e is well-typed.
Notice that well-typed patterns \(\boldsymbol{v}\) have at most one occurrence of an arrow variable (which is moreover in the rightmost position of the pattern). By extension of notation, we write \(\boldsymbol{v}^a\) as the only arrow variable in \(\boldsymbol{v}\), if it exists, otherwise we consider it as undefined. We also write by \(\boldsymbol{v}^+\) for the pattern obtained from \(\boldsymbol{v}\) by removing the arrow variable \(\boldsymbol{v}^a\), if any. In particular \(\boldsymbol{x}^+=\boldsymbol{x}\).
Remark 1
The readers acquainted with linear logic [18] may observe that the above grammar of types identifies a special fragment of this logic. In fact, the boolean type \(\textsf{Bool}\) may be expressed as the additive disjunction of the tensor unit: \(\textbf{1}\oplus \textbf{1}\). Since \(\otimes \) distributes over \(\oplus \), positive types are isomorphic to n-ary booleans, for some \(n\in \mathbb {N}\), i.e. \(\bigoplus _{n} \textbf{1}\), which is a notation for \(\textbf{1}\oplus \dots \oplus \textbf{1}\) n-times.
Moreover, by the isomorphisms https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-91118-7_12/MediaObjects/648518_1_En_12_Figa_HTML.gif , where & denotes the additive conjunction, we deduce that the grammar of \(\mathcal {L}\) types is equivalent to an alternation of balanced additive connectives, i.e. it can be presented by the grammar: https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-91118-7_12/MediaObjects/648518_1_En_12_Figb_HTML.gif , for \(n\in \mathbb {N}\). The typing system hence identifies a fragment of linear logic with non-trivial regularity (the alternation), expressing more than just the set of arrows between tuples of booleans.
Remark 2
Notice that the binary rules might require to contract some positive types in the environment, as the expressions in the premises might have positive variables in common. In fact, it is well-known that contraction and weakening rules are derivable for the positive formulas in the environment, e.g. \(\textbf{1}\oplus \textbf{1}\vdash (\textbf{1}\oplus \textbf{1})\otimes (\textbf{1}\oplus \textbf{1})\) is provable in linear logic. From a categorical point of view, this corresponds to the fact that positive types define co-algebras, and, operationally, that positive values can be duplicated or erased without the need of being promoted (see e.g. [15] in the setting of PPLs).
In the following we will represent a let-term \(\ell :=\texttt{let}\, \boldsymbol{v}_1\, =\, e_1\, \texttt{in}\, \dots \texttt{let}\, \boldsymbol{v}_n\, =\, e_n\, \texttt{in}\, \boldsymbol{v}_{n+1}\) by the more concise writing: \( \ell := (\boldsymbol{v}_1=e_1;\dots ;\boldsymbol{v}_n=e_n\, \texttt{in}\, \boldsymbol{v}_{n+1}) \). By renaming we can always suppose, if needed, that the patterns \(\boldsymbol{v}_1,\dots , \boldsymbol{v}_{n}\) are pairwise disjoint sequences of variables and that none of these variables has occurrences (free or not) outside the scope of its binder. We call \(\{\boldsymbol{v}_1=e_1\), ..., \(\boldsymbol{v}_n=e_n\}\) the set of the definitions of \(\ell \) (which has exactly n elements thanks to the convention of having \(\boldsymbol{v}_1, \dots , \boldsymbol{v}_n\) pairwise disjoint), and \(\biguplus _{i=1}^n\boldsymbol{v}_i\) the set of the defined variables of \(\ell \). The final pattern \(\boldsymbol{v}_{n+1}\) is called the output of \(\ell \). Notice that \(\ell \) is positive if its output is positive.
Example 2
A Bayesian network of n nodes can be represented by a closed let-term \(\ell \) having all variables positive and n definitions of the form \(x=\mathtt M(\boldsymbol{y})\). The variables are associated with the edges of the graph and the definitions with the nodes such that \(x=\mathtt M(y_1,\dots , y_k)\) represents a node with stochastic matrix \(\mathtt M\), an outgoing edge associated with x and k incoming edges associated with, respectively, \(y_1, \dots , y_k\). The output pattern contains the variables associated with a specific query to the Bayesian network.
For instance, the Bayesian network in Fig. 1a is represented by the let-term: \( (x_1\!=\!\mathtt M_1; x_2\!=\!\mathtt M_2x_1; x_3\!=\!\mathtt M_3x_2; x_4\!=\!\mathtt M_4; x_5\!=\!\mathtt M_5(x_3, x_4); x_6\!=\!\mathtt M_6(x_2, x_5) \, \texttt{ in } \, (x_3, x_6) ) \), which is the succinct notation for the let-term \(\ell \) in Fig. 1b. Notice that \(\ell \) induces a linear order on the nodes of the graph. The same graph can be represented by other let-terms, differing just from the order of its definitions. For example, by swapping the definitions of \(x_3\) and \(x_4\) we get a different let-term representing the same Bayesian network. We will consider this “swapping” invariance in full generality by defining the swapping rewriting \(\gamma \) in Fig. 6 and stating Lemma 1.
As discussed in the Introduction, let-terms with arrow variables and \(\lambda \)-abstractions might be needed to represent the result of applying the \(\textsf{VE}\) algorithm to a Bayesian network. For instance, Fig. 7 details the let-term produced by the elimination of the variables \((x_1, x_2, x_4, x_5)\). For example the closed subexpression \(e_2\) in Fig. 7 keeps local the variables \(x_1\) and \(x_2\) and has type \(\textsf{Bool}\otimes (\textsf{Bool}\multimap \textsf{Bool})\).

2.2 Semantics

We omit to detail an operational semantics of \(\mathcal {L}\), which can be defined in a standard way by using a sample-based or distribution-based semantics, in the spirit of e.g. [2]. We prefer to focus on the denotational semantics, which is more suitable to express the variable elimination algorithm in a compositional way. Below, examples 3 and 4 informally illustrate the denotational and operational behavior of a \(\texttt{let}\)-term, highlighting the linearity of its nature.
Semantics, a gentle presentation. We consider the semantics of weighted relations [29], which is an example of quantitative semantics of linear logic interpreting programs as matrices over non-negative real numbers. The intuition behind this semantics is quite simple: each type \(T\) is associated with a finite set \(\vert T \vert \) of indexes, called the web of \(T\) (see (3)). In case of a positive type, the web is the set of all possible outcomes of a computation of that type: the web of the boolean type \(\vert \textsf{Bool} \vert \) is the set of the two booleans \(\{\texttt{t},\texttt{f}\}\), the web of a tensor \(P\otimes Q\) is the cartesian product \(\vert P \vert \times \vert Q \vert \) of the web of its components. An arrow type \(P\multimap T\) is also associated with the cartesian product \(\vert P \vert \times \vert Q \vert \), intuitively representing the elements of the trace of a function of type \(P\multimap T\). To sum up, in this very simple fragment, webs are sets of nesting tuples of booleans, e.g. \(\vert (\textsf{Bool}\times \textsf{Bool})\multimap \textsf{Bool} \vert =\{((b_1,b_2),b_3)\;\vert \;b_i\in \vert \textsf{Bool} \vert \}\) .
The denotation \(\llbracket e\rrbracket \) of an expression e is then a matrix (sometimes called weighted relation) whose rows are indexed by the sequences of the elements in the web of the free variables in e and the columns are indexed by the elements in the web of the type of e. Eg, consider the expression \(e_0\) given by \(\texttt{let}\, y\, =\, fx\, \texttt{in}\, (z,y)\), with free variables \(f:\textsf{Bool}\multimap \textsf{Bool}\), \(x:\textsf{Bool}\) and \(z:\textsf{Bool}\) and type \(\textsf{Bool}\times \textsf{Bool}\). The matrix \(\llbracket e_0\rrbracket \) will have rows indexed by tuples \(((b_1,b_2), b_3, b_4)\) and columns by \((b_5,b_6)\) for \(b_i\)’s in \(\{\texttt{t},\texttt{f}\}\). Intuitively, the entry \(\llbracket e_0\rrbracket _{((b_1,b_2), b_3, b_4),(b_5,b_6)}\) gives a weight to the possibility of a computation where the free variables of e will “behave” as \((b_1,b_2)\) for f, \(b_3\) for y and \(b_4\) for z and the output will be \((b_5,b_6)\).
The matrix \(\llbracket e\rrbracket \) is defined by structural induction on the expression e by using matrix composition (for let-construction and application) and tensor product (for tuples), plus the diagonalisation of the indexes in the variables common to sub-expressions. Fig. 4 details this definition, giving a precise meaning to each programming construct. For example, taking the notation of Fig. 4, the definition of \(\llbracket (e',e'')\rrbracket _{\overline{a}, (b',b'')}\) states that the weight of getting \((b',b'')\) supposing \(\overline{a}\) is the product of the weights of getting \(b'\) from \(e'\) and \(b''\) from \(e''\), supposing \(\overline{a}\) in both cases. The sharing of \(\overline{a}\) in the two components of the tuple characterises the linearity of this calculus. Let us discuss this point with another example.
Example 3
(Linearity, denotationally). Let us write \(\texttt{coin}_{0.3} \) for a random generator of boolean values (a 0-ary stochastic matrix), modeling a biased coin. In our setting \(\llbracket \texttt{coin}_{0.3}\rrbracket \) is a row vector (0.3, 0.7) modeling the probability of sampling \(\texttt{t}\) or \(\texttt{f}\). Let e be the closed term \(\texttt{let}\, v\, =\, \texttt{coin}_{0.3}\, \texttt{in}\, \texttt{let}\, v'\, =\, v\, \texttt{in}\, (v,v')\), of type \(\textsf{Bool}\otimes \textsf{Bool}\), well-typed because v is positive (\(\textsf{ty}(v) = \textsf{Bool}\)). Since e is closed, \(\llbracket e\rrbracket \) is also a row vector, now of dimension 4. One can easily check that \(\llbracket e\rrbracket =(0.3, 0, 0, 0.7)\), stating that the only possible outcomes are the couples \((\texttt{t},\texttt{t})\) and \((\texttt{f},\texttt{f})\), while \((\texttt{t},\texttt{f})\), \((\texttt{f},\texttt{t})\) have probability zero to happen.
Notice that \(\llbracket e\rrbracket \) is different from \(\llbracket (\texttt{coin}_{0.3}, \texttt{coin}_{0.3})\rrbracket = (0.3^2, 0.21, 0.21, 0.7^2)\). In fact, \(\llbracket e\rrbracket \) is linear in \(\texttt{coin}_{0.3}\), while \(\llbracket (\texttt{coin}_{0.3}, \texttt{coin}_{0.3})\rrbracket \) is quadratic.
Example 4
(Linearity and \(\texttt{let}\)-reduction). Let us give an operational intuition for the term e in Ex. 3. There are two possibilities: we can first sample a boolean from \(\texttt{coin}_{0.3}\) and then replace v for the result of this sampling, or first replace v for the sampler \(\texttt{coin}_{0.3}\), then sampling a boolean from each copy of \(\texttt{coin}_{0.3}\). The semantics states that we follow the former possibility and not the latter (as usual in a setting with effects). Intuitively, \(\texttt{coin}_{0.3}\) reduces to a probabilistic sum https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-91118-7_12/MediaObjects/648518_1_En_12_Figc_HTML.gif , and so e first reduces to the sum
https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-91118-7_12/MediaObjects/648518_1_En_12_Figd_HTML.gif ,
eventually yielding https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-91118-7_12/MediaObjects/648518_1_En_12_Fige_HTML.gif . In contrast, duplicating the sampler would yield \((\texttt{coin}_{0.3},\texttt{coin}_{0.3})\) whose semantics is different, as discussed in Ex. 3. Finally, notice that replacing in e the argument \(\texttt{coin}_{0.3}\) with an expression \(\lambda x.u\) (of arrow type) yields a term which is not typable in our system (see Ex. 1).
Remark 3
In standard Call-by-Value \(\lambda \)-calculus, the abstraction plays two roles: defining functions and acting as a thunk (so allowing duplication). Please notice that linear abstraction does not act as a thunk. This is common in refinement such as [30] and [41], and in calculi based on Linear Logic [1, 9, 10, 15, 33, 42].
The rest of the subsection recalls from [29] the definitions and notations of the denotational semantics, but the reader can jump to the next section if already satisfied with these intuitions and willing to focus on variable elimination.
Semantics, formally. Let us fix some basic notation from linear algebra. Metavariables \(S, {T}, {U}\) range over finite sets. We denote by \(\textsf{s}(S)\) the cardinality of a set \(S\). We denote by \(\mathbb {R}_{\ge 0}\) the cone of non-negative real numbers. Metavariables \(\phi ,\psi ,\xi \) will range over vectors in \(\mathbb {R}_{\ge 0}^{S}\), for S a finite set, \(\phi _a\) denoting the scalar associated with \(a\in S\) by \(\phi \in \mathbb {R}_{\ge 0}^{S}\). Matrices will be vectors indexed by pairs, e.g. in \(\mathbb {R}_{\ge 0}^{S\times {T}}\) for \(S\) and \({T}\) two finite sets. We may write \(\phi _{a,b}\) instead of \(\phi _{(a,b)}\) for \((a,b)\in S\times {T}\) if we wish to underline that we are considering indexes that are pairs. Given \(\phi \in \mathbb {R}_{\ge 0}^{S\times {T}}\) and \(\psi \in \mathbb {R}_{\ge 0}^{{T}\times {U}}\), the standard matrix multiplication is given by \({\phi }{\psi }\in \mathbb {R}_{\ge 0}^{S\times {U}}\): \(({\phi }{\psi })_{a,c}\mathrel {:=}\sum _{b\in {T}} \phi _{a,b}\psi _{b,c} \in \mathbb {R}_{\ge 0}\). The identity matrix is denoted \(\delta \in \mathbb {R}_{\ge 0}^{S\times S}\) and defined by \(\delta _{a,a'}=1\) if \(a=a'\), otherwise \(\delta _{a,a'}=0\).
A less standard convention, but common in this kind of denotational semantics, is to consider the rows of a matrix \(\phi \) as the domain and the columns as the codomain of the underlined linear map. Hence, a vector in \(\mathbb {R}_{\ge 0}^{S}\) is considered as a one line matrix \(\mathbb {R}_{\ge 0}^{1\times S}\), and the application of a vector \(\psi \in \mathbb {R}_{\ge 0}^{S}\) to a matrix \(\phi \in \mathbb {R}_{\ge 0}^{S\times {T}}\), is given by \(\phi \cdot \psi \mathrel {:=}{\psi }{\phi }\in \mathbb {R}_{\ge 0}^{1\times {T}}\cong \mathbb {R}_{\ge 0}^{{T}}\).
The model denotes a type \(T\) with a set \(\vert T \vert \), called the web of \(T\), as follows:
$$\begin{aligned} \vert \textsf{Bool} \vert &\mathrel {:=}\{\texttt{t},\texttt{f}\}\,, & \vert P\otimes T \vert &\mathrel {:=}\vert P\multimap T \vert \mathrel {:=}\vert P \vert \times \vert T \vert \,. \end{aligned}$$
(3)
To denote an expression e, we must associate a web with the set of free variables occurring in e. Given a finite set of variables \(\mathcal {V}\), we define \(\vert \mathcal {V} \vert \) by using indexed products: \( \vert \mathcal {V} \vert \mathrel {:=}\prod _{v\in \mathcal {V}}\vert \textsf{ty}(v) \vert \). Metavariables \(\overline{a}, \overline{b}, \overline{c}\) denote elements in such webs \(\vert \mathcal {V} \vert \). In fact, \(\overline{a}\in \vert \mathcal {V} \vert \) can be seen as a function mapping any variable \(v\in \mathcal {V}\) to an element \(\overline{a}_v\in \vert \textsf{ty}(v) \vert \). We denote by \(\star \) the empty function, which is the only element of \(\vert \emptyset \vert =\prod _{\emptyset }\). Given a subset \(\mathcal {V}'\subseteq \mathcal {V}\), we denote by \(\overline{a}\vert _{\mathcal {V}'} \) the restriction of \(\overline{a}\) to \(\mathcal {V}'\), i.e. \(\overline{a}\vert _{\mathcal {V}'} \in \vert \mathcal {V}' \vert \). Also, given two disjoint sets of variables \(\mathcal {V}\) and \(\mathcal W\) we denote by \(\overline{a}\uplus \overline{b}\) the union of an element \(\overline{a}\in \vert \mathcal {V} \vert \) and an element \(\overline{b}\in \vert \mathcal \vert {W}\), i.e. \(\overline{a}\uplus \overline{b}\in \vert V\uplus \mathcal W \vert \) and: \((\overline{a}\uplus \overline{b})_v\mathrel {:=}\overline{a}_v\) if \(v\in \mathcal {V}\), and \((\overline{a}\uplus \overline{b})_v\mathrel {:=}\overline{b}_v\) if \(v\in \mathcal W\).
Fig. 4.
Denotation of e as a matrix \(\llbracket e\rrbracket \) giving a linear map from \(\vert \textsf{FV}(e) \vert \) to \(\vert \textsf{ty}(e) \vert \), so \(\overline{a}\in \vert \textsf{FV}(e) \vert \) and \(b\in \vert \textsf{ty}(e) \vert \). In the tuple and \(\lambda \) cases, we suppose \(b=(b',b'')\).
An expression e of type \(T\) will be interpreted as a linear map \(\llbracket e\rrbracket \) from \(\mathbb {R}_{\ge 0}^{\vert \textsf{FV}(e) \vert }\) to \(\mathbb {R}_{\ge 0}^{\vert T \vert }\). As such, \(\llbracket e\rrbracket \) can then be presented as a matrix in \(\mathbb {R}_{\ge 0}^{\vert \textsf{FV}(e) \vert \times \vert T \vert }\). Fig. 4 recalls the definition of \(\llbracket e\rrbracket \) by structural induction on e. In the case of \(\mathtt M(\boldsymbol{x})\), we take the liberty to consider an element \(\overline{a}\in \vert \boldsymbol{x} \vert \) as actually the tuple of its components, ordered according to the order of the variables in the pattern \(\boldsymbol{x}\). Similarly, when we compare \(\overline{a}'''\) with \(a'\) in \(\llbracket f \boldsymbol{x}\rrbracket \).
Example 5
Recall the term \(\ell \) in Ex. 2. It is closed and of type \(\textsf{Bool}\otimes \textsf{Bool}\), hence \(\llbracket \ell \rrbracket \) is a one-row matrix in \(\mathbb {R}_{\ge 0}^{\vert \emptyset \vert \times \vert \textsf{Bool}\otimes \textsf{Bool} \vert }\simeq \mathbb {R}_{\ge 0}^4\). By unfolding the definition in Fig. 4, we get the following expression for \(\llbracket \ell \rrbracket _{\star , (b_3, b_6)}\) with \(b_3,b_6\in \{\texttt{t}, \texttt{f}\}\), where all \(b_i\) vary over \(\{\texttt{t}, \texttt{f}\}\), the index i referring to the corresponding variable in \(\ell \):
$$\begin{aligned} & \sum _{b_1}(\mathtt M_1)_{b_1} \Biggl (\sum _{b_2}(\mathtt M_2)_{b_1, b_2} \biggl (\sum _{b'_3}(\mathtt M_3)_{b_2, b'_3} \Bigl (\sum _{b_4}(\mathtt M_4)_{b_4} \bigl (\sum _{b_5}(\mathtt M_5)_{(b_3,b_4), b_5} \nonumber \\ & \qquad \qquad \qquad \qquad \qquad \qquad \qquad \quad \bigl (\sum _{b'_6}(\mathtt M_6)_{(b_2, b_5), b'_6} \delta _{b'_3, b_3} \delta _{b'_6, b_6} \bigr ) \bigr ) \Bigr ) \biggr ) \Biggr ) \,. \end{aligned}$$
(4)
Expression (4) describes a way of computing \(\llbracket \ell \rrbracket \) in a number of basic operations which is of order \(2^3\) terms for each possible \(2^2\) values of \(b_3,b_6\).
For a more involved example, let us consider the let-term \(\ell '\) in line (L8) of Fig. 7, which is the result of the elimination of the variables \((x_1,x_2)\). We first calculate the semantics \(\llbracket e_2\rrbracket \) of the sub-expression keeping local \((x_1,x_2)\). Notice that \(e_2\) is a closed expression of type \(\textsf{Bool}\otimes (\textsf{Bool}\multimap \textsf{Bool})\), so consider \(b_3\in \vert \textsf{Bool} \vert \) and \((b_f,b'_f)\in \vert \textsf{Bool}\multimap \textsf{Bool} \vert \), we have (after some simplification of \(\delta \)’s):
$$\begin{aligned} \llbracket e_2\rrbracket _{\star , (b_3, (b_f, b'_f))}= \sum _{b_2} \Bigl ( \sum _{b_1} (\mathtt M_1)_{b_1} (\mathtt M_2)_{b_1,b_2} \Bigr ) (\mathtt M_3)_{b_2,b_3} (\mathtt M_6)_{(b_2,b_f),b'_f}\,. \end{aligned}$$
(5)
We can then associate \(\llbracket \ell '\rrbracket _{\star , (b_3, b_6)}\) with the following algebraic expression:
$$\begin{aligned} \sum _{b_3', (b_f, b'_f)}\!\!\!\! \llbracket e\rrbracket _{\star , (b_3, (b_f, b'_f))}\! \Biggl (\sum _{b_4} (\mathtt M_4)_{b_4} \Bigl ( \sum _{b_5} (\mathtt M_5)_{b_4,b_5} \bigl ( \sum _{b'_6} \delta _{b_5, b_f} \delta _{b_f',b'_6} \bigr ) \delta _{b_3',b_3} \delta _{b'_6, b_6} \bigr )\! \Bigr )\!\! \Biggr ) \end{aligned}$$
(6)
Expression (6) reduces to a number of basic operations which is of order \(2^2\). By one memoizing the computation of \(\llbracket e\rrbracket \), Expression 6 offers a way of computing the matrix \(\llbracket \ell '\rrbracket \) in a time linear in \(2^2\times 2^2\). Indeed, Proposition 6 guarantees that \(\ell \) and \(\ell '\) (in fact all let-terms in Fig. 7) have the same denotational semantics: so the computation of \(\llbracket \ell '\rrbracket \) gains a factor of 2 with respect to (4).
Let us conclude this subsection by observing that the type of a closed expression allows for computing the total mass of the denotational semantics of that expression. With any positive type \(P\) we associate its dimension \(\textsf{dim}(P)\in \mathbb {N}\) by \(\textsf{dim}(\textsf{Bool})=2\) and \(\textsf{dim}(P\otimes Q)=\textsf{dim}(P)\textsf{dim}(Q)\). This means that \(\textsf{dim}(P)\) is the cardinality of \(\vert P \vert \). And with any type \(T\) we associate its height \(\textsf{ht}(T)\in \mathbb {N}\), the definition is: \(\textsf{ht}(P)=1\), \(\textsf{ht}(P\multimap T)=\textsf{dim}(P)\times \textsf{ht}(T)\) and \(\textsf{ht}(P\otimes T)=\textsf{ht}(T)\).
Proposition 1
For any closed expression e, one has \(\sum _{a\in \vert \textsf{ty}(e) \vert }\llbracket e\rrbracket _{\star , a}=\textsf{ht}(\textsf{ty}(e))\).
Example 6
Take the type \(\textsf{Bool}\otimes \textsf{Bool}\) of the let-terms \(\ell \) and \(\ell '\) discussed in Example 5. We have that \(\textsf{ht}(\textsf{Bool}\otimes \textsf{Bool})=1\), in accordance with the fact that all closed expressions of that type (such as \(\ell \) and \(\ell '\)) describe joint probability distributions, so are denoted with vectors of total mass 1. On the contrast, consider the type \(\textsf{Bool}\otimes (\textsf{Bool}\multimap \textsf{Bool})\) of the expression \(e_2\) keeping local the variables \(x_1\) and \(x_2\). We have \(\textsf{ht}(\textsf{Bool}\otimes (\textsf{Bool}\multimap \textsf{Bool}))=\textsf{ht}(\textsf{Bool}\multimap \textsf{Bool})=2\), which is the expected total mass of a stochastic matrix over booleans. However notice that the type \(\textsf{Bool}\otimes (\textsf{Bool}\multimap \textsf{Bool})\) is subtler than that of a stochastic matrix \(\textsf{Bool}\multimap \textsf{Bool}\): in fact, by using the isomorphisms discussed in Remark 1, we have \(\textsf{Bool}\otimes (\textsf{Bool}\multimap \textsf{Bool})\simeq (\textsf{Bool}\multimap \textsf{Bool})\oplus (\textsf{Bool}\multimap \textsf{Bool})\), which is the type of a probabilistic distribution of stochastic matrices.

3 Variable Elimination \(\textsf{VE}^{\mathsf F}\) over Let-Terms Factors

As mentioned in the Introduction, variable elimination is an iterative procedure transforming sets of factors (one can think of these as originally provided by a Bayesian network). We recall this procedure, adapting it to our setting—in particular, we start from a set \(\textsf{Fs}(\ell )\) of factors generated by a let-term \(\ell \) representing a Bayesian network. Subsect. 3.1 defines factors and the main operations on them (product and summing-out). Subsect. 3.2 shows how to associate a let-term \(\ell \) with a set of factors \(\textsf{Fs}(\ell )\) such that from their product one can recover \(\llbracket \ell \rrbracket \) (Prop. 3). Finally, Subsect. 3.3 presents the variable elimination algorithm as a transformation \(\textsf{VE}^{\mathsf F}\) over \(\textsf{Fs}(\ell )\) (Def. 5) and Prop. 4 gives the soundness of the algorithm. This latter result is standard from the literature (see e.g. [8]), and the contribution of this section is the definition of \(\textsf{Fs}(\ell )\) which is essential to link this variable elimination \(\textsf{VE}^{\mathsf F}\) on factors to our main contribution given in the next section: the variable elimination \(\textsf{VE}^\mathcal {L}\) as a term-rewriting process.

3.1 Factors

Definition 1
(Factor). A factor \(\phi \) is a pair \((\textsf{Var} (\phi ), \textsf{Fun}(\phi ))\) of a finite set \(\textsf{Var} (\phi )\) of typed variables and a function \(\textsf{Fun}(\phi )\) from the web \(\vert \textsf{Var} (\phi ) \vert \) to \(\mathbb {R}_{\ge 0}\).
We will shorten the notation \(\textsf{Fun}(\phi )\) by writing just \(\phi \) when it is clear from the context that we are considering the function associated with a factor and not the whole pair \((\textsf{Var} (\phi ), \textsf{Fun}(\phi ))\). We often consider \(\textsf{Fun}(\phi )\) as a vector indexed by the elements of its domain, so that \({\phi }_{\overline{a}}\) stands for \(\textsf{Fun}(\phi )(\overline{a})\), for every \(\overline{a}\in \vert \textsf{Var} (\phi ) \vert \).
The degree of \(\phi \), written \(\textsf{d}_\phi \), is the cardinality of \(\textsf{Var} (\phi )\), and the base of \(\phi \), written \(\textsf{b}_\phi \), is the maximal cardinality of \(\vert v \vert \) for every \(v\in \textsf{Var} (\phi )\). Notice that \(\textsf{b}_\phi ^{\textsf{d}_\phi }\) is an upper bound to the dimension of \(\textsf{Fun}(\phi )\), i.e. the cardinality of \(\vert \textsf{Var} (\phi ) \vert \).
Example 7
Sect. 3.2 formalises how to associate the definitions of a let-expression with factors. Let us anticipate a bit and see as an example the factor \(\phi \) that will be associated with the definition \(x_5=\mathtt M_5(x_3,x_4)\) in the let-term in Ex. 2. We have \(\textsf{Var} (\phi )=\{x_3,x_4,x_5\}\) and for every \(a, b, c\in \vert \textsf{Bool} \vert \) we have \(\textsf{Fun}(\phi )(a,b,c)=(\mathtt M_5)_{(a,b),c}\). Notice that \(\phi \) forgets the input/output (or rows/columns) distinction carried by the indexes of the stochastic matrix \(\mathtt M_5\).
A factor \((\textsf{Var} (\phi ), \textsf{Fun}(\phi ))\) involves two “levels” of indexing: one is given by the variables \(v_1, v_2, \dots \in \textsf{Var} (\phi )\) tagging the different sets of the product \(\vert \textsf{Var} (\phi ) \vert \mathrel {:=}\prod _{v\in \textsf{Var} (\phi )}\vert v \vert \), and the other “level” is given by \(\overline{a}, \overline{b},\dots \in \vert \textsf{Var} (\phi ) \vert \) labelling the different components of the vector \(\textsf{Fun}(\phi )\), which we call web elements.
Recall that the set of variables \(\textsf{Var} (\phi )\) endows \(\vert \textsf{Var} (\phi ) \vert \) with a cartesian structure, so that we can project a web element \(\overline{a}\in \vert \textsf{Var} (\phi ) \vert \) on some subset of variables \(\mathcal {V}'\subseteq \textsf{Var} (\phi )\) by writing \(\overline{a}\vert _{\mathcal {V}'} \), as well as we can pair two web elements \(\overline{a}\uplus \overline{a}'\) whenever \(\overline{a}\in \vert \textsf{Var} (\phi ) \vert \) and \(\overline{a}'\in \vert \textsf{Var} (\phi )' \vert \) and \(\textsf{Var} (\phi )\cap \textsf{Var} (\phi )'=\emptyset \).
Fig. 5.
Summing-out \(\sum _{\mathcal {V}}(\phi )\) of a set of variables \(\mathcal {V}\) in a factor \(\phi \) and product \(\phi \odot \psi \) of two factors \(\phi \), \(\psi \). We suppose \(\overline{a}\in \vert \textsf{Var} (\phi )\setminus \mathcal {V} \vert \) and \(\overline{c}\in \vert \textsf{Var} (\phi \odot \psi ) \vert \).
Fig. 5 defines the two main operations on factors: summing-out and binary products. We illustrate them with some examples and remarks.
Example 8
By recalling the factor \(\phi \) of Ex. 7, we have that \(\textsf{Var} (\sum _{\{x_3\}}(\phi ))=\{x_4,x_5\}\) and for every \(a,b\in \vert \textsf{Bool} \vert \), \(\sum _{\{x_3\}}(\phi )_{(a,b)}=\mathtt M_{(\texttt{t},a),b}+\mathtt M_{(\texttt{f},a),b}\). In fact, we can do weirder summing-out, as for example \(\textsf{Var} (\sum _{\{x_3, x_5\}}(\phi ))=\{x_4\}\), so that \(\sum _{\{x_3, x_5\}}(\phi )_{a}=\mathtt M_{(\texttt{t},a),\texttt{t}}+\mathtt M_{(\texttt{t},a),\texttt{f}}+\mathtt M_{(\texttt{f},a),\texttt{t}}+\mathtt M_{(\texttt{f},a),\texttt{f}}\) may be a scalar greater than one, no more representing a probability.
With the notations of Fig. 5, if \(\phi \) is a join distribution over \(\vert \textsf{Var} (\phi ) \vert \), the summing out of \(\mathcal {V}\) in \(\phi \) gives the marginal distribution over \(\vert \textsf{Var} (\phi )\setminus \mathcal {V} \vert \). In the degenerate case where \(\textsf{Var} (\phi )\subseteq \mathcal {V}\), then \(\textsf{Var} (\sum _{\mathcal {V}}(\phi ))\) is the empty set and \(\sum _{\mathcal {V}}(\phi )_{\star }\) is the total mass of \(\phi \), i.e. \(\sum _{\overline{b}\in \vert \textsf{Var} (\phi ) \vert }\phi _{\overline{b}}\).
Example 9
Recall the factor \(\phi =(\{x_3,x_4,x_5\}, (a,b,c\mapsto (\mathtt M_5)_{(a,b),c}))\) of Ex. 7, representing the definition \(x_5=\mathtt M_5(x_3,x_4)\) in the let-term in Ex.2, and consider a factor \(\psi =(\{x_3,x_4\}, (a,b\mapsto \mathtt M'_{a,b}))\) representing some definition \(x_4=\mathtt M'(x_3)\). Then, \(\textsf{Var} ({\phi \odot \psi })=\{x_3,x_4,x_5\}\) and for every \(a,b,c\in \vert \textsf{Bool} \vert \), we have \(\textsf{Fun}(\phi \odot \psi )(a,b,c)=(\mathtt M_5)_{(a,b),c}\mathtt M'_{a,b}\). Notice that the factor product \(\phi \odot \psi \) is not the tensor product \(\otimes \) of the vectors \(\textsf{Fun}(\phi )\) and \(\textsf{Fun}(\psi )\), as variables can be shared between the different factors. In fact, the dimension of \(\textsf{Fun}(\phi )\otimes \textsf{Fun}(\psi )\) is \(2^3\times 2^2=2^{5}\), while \(\textsf{Fun}(\phi \odot \psi )\) is \(2^3\).
Notice that the computation of the sum out \(\sum _{\mathcal {V}}(\phi )\) is in \(O\bigl (\textsf{b}_\phi ^{\textsf{d}_\phi }\bigr )\), as \(\textsf{b}_\phi ^{\textsf{d}_\phi }\) is an upper bound to the cardinality of \(\vert \textsf{Var} (\phi ) \vert \) which gives the number of basic operations needed to define \(\sum _{\mathcal {V}}(\phi )\). Analogously, the computation of \(\phi \odot \psi \) is in \(O\bigl (\textsf{b}_{\phi \odot \psi }^{\textsf{d}_{\phi \odot \psi }}\bigr )=O\bigl (\max (\textsf{b}_{\phi },\textsf{b}_\psi )^{\textsf{d}_{\phi }+\textsf{d}_{\psi }}\bigr )\), as \(\textsf{b}_{\phi \odot \psi }^{\textsf{d}_{\phi \odot \psi }}\) is an upper bound to the cardinality of \(\vert \phi \odot \psi \vert \), which gives the number of basic operations needed to define \(\phi \odot \psi \).
Proposition 2
Factor product is associative and commutative, with neutral element the empty factor \((\emptyset , 1)\). Moreover:
1.
labelfactspspropspsprpr \(\sum _{\mathcal {V}}(\sum _{\mathcal {W}}(\phi ))=\sum _{\mathcal {V}\cup \mathcal {W}}(\phi )\);
 
2.
\(\sum _{\mathcal {V}}(\phi \odot \psi )=(\sum _{\mathcal {V}}(\phi ))\odot \psi \), whenever \(\textsf{Var} (\psi )\cap \mathcal {V}=\emptyset \).
 
Definition 2
(I-factor product). Let I be a finite set. Given a collection of factors \((\phi _i)_{i\in I}\), we define their factor product as the factor \(\bigodot _{i\in I}\phi _i\mathrel {:=}\phi _{i_1}\odot \dots \odot \phi _{i_n}\), for some enumeration of I. This is well-defined independently from the chosen enumeration because of Prop. 2.
By iterating our remark on the complexity for computing binary products, we have that the computation of the whole vector \(\textsf{Var} (\bigodot _{i\in I}\phi _i)\) is in \(O\bigl (\textsf{s}(I)\textsf{b}_{\bigodot _{i\in I}\phi _i}^{\textsf{d}_{\bigodot _{i\in I}\phi _i}}\bigr )\), where we recall \(\textsf{s}(I)\) denotes the cardinality of I.

3.2 Let-terms as Sets of Factors

Let us introduce some convenient notation. Metavariables \(\varGamma , \varDelta , \varXi \) will range over finite sets of factors. We lift the notation for factors to sets of factors: we write \(\textsf{Var} (\varGamma )\) for the union \(\bigcup _{\phi \in \varGamma }\textsf{Var} (\phi )\), so we can speak about a variable of \(\varGamma \) meaning a variable of one (or more) factor in \(\varGamma \); hence, the degree \(\textsf{d}_{\varGamma }\) (resp. the base \(\textsf{b}_{\varGamma }\)) of \(\varGamma \) is the cardinality of \(\textsf{Var} (\varGamma )\) (resp. the maximal cardinality of a set \(\vert v \vert \) for \(v\in \textsf{Var} (\varGamma )\)). Also, the operations of the sum-out and product with a factor are lifted component-wise, i.e. \(\sum _{\mathcal {V}}(\varGamma )\mathrel {:=}\{\sum _{\mathcal {V}}(\phi )\mathrel {|}\phi \in \varGamma \}\) and \(\psi \odot \varGamma \mathrel {:=}\{\psi \odot \phi \mathrel {|}\phi \in \varGamma \}\). In contrast, the I-factor product \(\bigodot \varGamma \) returns the single factor result of the products of all factors in \(\varGamma \), according to Def. 2.
Given a set of variables \(\mathcal {V}\), it will be convenient to partition \(\varGamma \) into \(\varGamma _{\mathcal {V}}\) and \(\varGamma _{\lnot \mathcal {V}}\), depending on whether a factor in \(\varGamma \) has common labels with \(\mathcal {V}\) or not, i.e.:
$$\begin{aligned} \varGamma _{\mathcal {V}}&\mathrel {:=}\{\phi \in \varGamma \mathrel {|}\textsf{Var} (\phi )\cap \mathcal {V}\ne \emptyset \}\,,& \varGamma _{\lnot \mathcal {V}}&\mathrel {:=}\{\phi \in \varGamma \mathrel {|}\textsf{Var} (\phi )\cap \mathcal {V}=\emptyset \}\,. \end{aligned}$$
(7)
Notice that \(\varGamma =\varGamma _{\mathcal {V}}\uplus \varGamma _{\lnot \mathcal {V}}\), as well as \(\textsf{Var} (\varGamma )\cap \mathcal {V}\subseteq \textsf{Var} (\varGamma _{\mathcal {V}})\) and \(\textsf{Var} (\varGamma _{\lnot \mathcal {V}})\subseteq \textsf{Var} (\varGamma )\setminus \mathcal {V}\). In the case of singletons \(\{v\}\), we can simply write \(\varGamma _{v}\) and \(\varGamma _{\lnot v}\).
Definition 3
(\(\textsf{F}(\boldsymbol{v} = e)\)). Given a pattern \(\boldsymbol{v}\) and expression e s.t. \(\textsf{FV}(\boldsymbol{v}) \cap \textsf{FV}(e) =\emptyset \), we define \(\textsf{F}(\boldsymbol{v} = e)\), by: \(\textsf{Var} (\textsf{F}(\boldsymbol{v} = e))\mathrel {:=}\textsf{FV}(e) \uplus \textsf{FV}(\boldsymbol{v}) \) and \(\textsf{Fun}(\textsf{F}(\boldsymbol{v} = e))\mathrel {:=}\overline{a}\uplus \overline{b}\mapsto \llbracket e\rrbracket _{\overline{a}, \overline{b}}\), for \(\overline{a}\in \vert \textsf{FV}(e) \vert \), \(\overline{b}\in \vert \textsf{FV}(\boldsymbol{v}) \vert \).
In a definition \(\boldsymbol{v} = e\), e’s free variables can be seen as input channels, while \(\boldsymbol{v}\)’s variables as output channels. This is also reflected in the matrix \(\llbracket e\rrbracket \) where rows are associated with inputs and columns with outputs. In contrast, a factor forgets such a distinction, mixing all indexes in a common family.
Let us warn that Def. 3 as well as the next Def. 4 are not compatible with renaming of bound variables (a.k.a. \(\alpha \)-equivalence), as they use bound variables as names for the variables of factors. Of course, one can define an equivalence of factors by renaming their variables, but this must be done consistently on all factors taken in consideration.
Definition 4
(\(\textsf{Fs}(\ell )\)). Given a let-term \(\ell \) with output pattern \(\boldsymbol{w}\), we define the set of factors \(\textsf{Fs}(\ell )\), by induction on the number of definitions of \(\ell \):
$$\begin{aligned} \textsf{Fs}(\boldsymbol{w}) &\mathrel {:=}(\textsf{FV}(\boldsymbol{w}) , \boldsymbol{a}\mapsto 1) \\ \textsf{Fs}((\boldsymbol{v}=e\, \texttt{in}\, \ell )) &\mathrel {:=}{\left\{ \begin{array}{ll} \left\{ \sum _{f}(\textsf{F}(\boldsymbol{v} = e)\!\odot \!\textsf{Fs}(\ell )_{f})\right\} \!\uplus \!\textsf{Fs}(\ell )_{\lnot f} & \text {if } f\!\in \!\textsf{FV}(\boldsymbol{v})^{a}\!\setminus \!\textsf{FV}(\boldsymbol{w}) ,\\ \{\textsf{F}(\boldsymbol{v} = e)\}\uplus \textsf{Fs}(\ell ) & \text {otherwise}. \end{array}\right. } \end{aligned}$$
The definition of \(\textsf{Fs}((\boldsymbol{v}=e\, \texttt{in}\, \ell ))\) is justified by the linear status of the arrow variables, assured by the typing system. In a let-term \(((\boldsymbol{x}, f)=e\, \texttt{in}\, \ell )\), we have two disjoint cases: either the arrow variable f occurs free exactly once in one of the definitions of \(\ell \), or f is free in the output \(\boldsymbol{w}\) of \(\ell \). In the former case, \(\textsf{Fs}(\ell )_{f}\) is a singleton \(\{\phi \}\), and we can sum-out f once multiplied \(\textsf{F}((\boldsymbol{x}, f) = e)\) with \(\phi \), as no other factor will use f. In the latter case, we keep f in the family of the factors associated with the let-term, as this variable will appear in its output.
Example 10
Let us consider Fig. 7. The let-term \(\ell \) in (L1) has exactly 7 factors, the 1-constant factor associated with the output and one factor for each definition, carrying the corresponding stochastic matrix \(\mathtt M_i\). For a less obvious example, consider the term \(\ell '\) in (L8). The set \(\textsf{Fs}(\ell ')\) has 4 factors: one for the output, two associated with the definitions of, respectively, \(x_4\) and \(x_5\) and the last one defined as \(\sum _{f}(\textsf{F}(x_3, f = e_2)\odot \textsf{F}(x_6 = f x_5))\). Notice that \(\textsf{F}(x_6 = f x_5)_{\overline{a}}=1\) if \(\overline{a}_f = (\overline{a}_{x_5}, \overline{a}_{x_6})\) otherwise \(\textsf{F}(x_6 = f x_5)_{\overline{a}}=0\). Therefore the sum-out on f produces a sum of only one term, whenever fixed \(b_5\in \vert x_5 \vert \) and \(b_6\in \vert x_6 \vert \).
Notice also that all let-terms from line (L12) have a set of factors of cardinality two, although they may have more than one definition.
The following proposition shows how to recover the quantitative semantics \(\llbracket \ell \rrbracket \) of a let-term \(\ell \) out of the set of factors \(\textsf{Fs}(\ell )\): take the product of all factors in \(\textsf{Fs}(\ell )\) and sum-out all variables that are not free in \(\ell \) nor occurs in the output. The proposition is proven by induction on \(\ell \).
Proposition 3
Consider a let-term \(\ell \) with output \(\boldsymbol{v}\). Let \(\mathcal F=\textsf{Var} (\textsf{Fs}(\ell ))\), and consider \(\overline{a}\in \vert \textsf{FV}(\ell ) \vert \), \(\overline{b}\in \vert \textsf{FV}(\boldsymbol{v}) \vert \). If \(\overline{a}\vert _{\textsf{FV}(\ell ) \cap \textsf{FV}(\boldsymbol{v}) } =\overline{b}\vert _{\textsf{FV}(\ell ) \cap \textsf{FV}(\boldsymbol{v}) } \), with \(\overline{a}' = \overline{a}\vert _{\textsf{FV}(\ell ) \setminus \textsf{FV}(\boldsymbol{v}) } \), \(\overline{b}' = \overline{b}\vert _{\textsf{FV}(\boldsymbol{v}) \setminus \textsf{FV}(\ell ) } \), and \(\overline{c}=\overline{a}\vert _{\textsf{FV}(\ell ) \cap \textsf{FV}(\boldsymbol{v}) } \!\!\!=\!\overline{b}\vert _{\textsf{FV}(\ell ) \cap \textsf{FV}(\boldsymbol{v}) } \), we have \(\llbracket \ell \rrbracket _{\overline{a},\overline{b}} =\sum _{\mathcal F\setminus (\textsf{FV}(\ell ) \cup \textsf{FV}(\boldsymbol{v}) )}(\bigodot \textsf{Fs}(\ell ))(\overline{a}' \uplus \overline{c}\uplus \overline{b}')\). Otherwise \(\llbracket \ell \rrbracket _{\overline{a},\overline{b}} =0\). In particular, if \(\ell \) is closed, then https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-91118-7_12/MediaObjects/648518_1_En_12_Figf_HTML.gif .

3.3 Variable Elimination \(\textsf{VE}^{\mathsf F}\) over Sets of Factors

We recall the definition of the variable elimination algorithm as acting on sets of factors. Prop. 4 states its soundness, which is a standard result that we revisit here just to fix our notation. We refer to [8, ch.6] for more details.
Definition 5
(Variable elimination over sets of factors). The elimination of a variable v in a set of factors \(\varGamma \) is the set of factors \(\textsf{VE}^{\mathsf F}(\varGamma , v)\) defined by:
$$\begin{aligned} \textsf{VE}^{\mathsf F}(\varGamma , v)&\mathrel {:=}\{ \sum _v\bigodot \varGamma _{v}\}\uplus \varGamma _{\lnot v} \end{aligned}$$
(8)
This definition extends to finite sequences of variables \((v_1,\dots , v_h)\) by iteration:
$$\begin{aligned} \textsf{VE}^{\mathsf F}(\varGamma , (v_1,\dots , v_h))&\mathrel {:=}\textsf{VE}^{\mathsf F}(\textsf{VE}^{\mathsf F}(\varGamma , v_1), (v_2,\dots , v_h)) \end{aligned}$$
(9)
if \(h>0\), otherwise \(\textsf{VE}^{\mathsf F}(\varGamma , ())=\varGamma \).
Example 11
Recall the sets of factors \(\textsf{Fs}(\ell )\) and \(\textsf{Fs}(\ell ')\) of Ex. 10. An easy computation gives: \(\textsf{Fs}(\ell ')=\textsf{VE}^{\mathsf F}(\textsf{Fs}(\ell ), (x_1,x_2))\).
The soundness of \(\textsf{VE}^{\mathsf F}(\varGamma , (v_1,\dots , v_h))\) follows by induction on the length h of the sequence \((v_1,\dots , v_h)\), using Prop. 2 :
Proposition 4
(Soundness). We have: \(\bigodot \textsf{VE}^{\mathsf F}(\varGamma , (v_1,\dots , v_h)) = \sum _{\{v_1,\dots , v_h\}}\bigodot \varGamma \). In particular, \(\textsf{Var} (\textsf{VE}^{\mathsf F}(\varGamma , (v_1,\dots , v_h))) = \textsf{Var} (\varGamma )\setminus \{v_1,\dots , v_h\}\).
Soundness states that the \(\textsf{VE}^{\mathsf F}\) transformation corresponds to summing-out the variables to eliminate from the product of the factors into consideration. This means that if the factors in \(\varGamma \) represent random variables, then \(\bigodot \textsf{VE}^{\mathsf F}(\varGamma , (v_1,\dots , v_h))\) computes the join distribution over the variables in \(\textsf{Var} (\varGamma )\setminus (v_1,\dots , v_h)\).

4 Variable Elimination \(\textsf{VE}^\mathcal {L}\) as Let-Term Rewriting

This section contains our main contribution, expressing the variable elimination algorithm syntactically, as a rewriting of let-terms, transforming the “eliminated” variables from global variables (i.e. defined by a definition of a let-term and accessible to the following definitions), into local variables (i.e. private to some subexpression in a specific definition). Subsect. 4.1 defines such a rewriting \(\rightarrow \) of let-terms (Fig. 6) and states some of its basic properties. Subsect. 4.2 introduces the \(\textsf{VE}^\mathcal {L}\) transformation as a deterministic strategy to apply \(\rightarrow \) in order to make local the variable to be eliminated (Def. 8), without changing the denotational semantics of the term (Prop. 6). Theorem 1 and Corollary 1 prove that \(\textsf{VE}^\mathcal {L}\) and \(\textsf{VE}^{\mathsf F}\) are equivalent, showing that \(\textsf{Fs}(\cdot )\) commutes over the two transformations. Finally, Subsect. 4.2 briefly discusses some complexity properties, namely that the \(\textsf{VE}^\mathcal {L}\) increases the size of a let-term quite reasonably, keeping a linear bound.

4.1 Let-Term Rewriting

Fig. 6.
Let-terms rewriting rules. We recall that x’s variables (f’s variables) are supposed positive (resp. arrow), while v’s may be positive or arrow. We also recall from Section 2 that \(\boldsymbol{v}^a\) denotes the only arrow variable in a pattern \(\boldsymbol{v}\), if it exists, and \(\boldsymbol{v}^+\) denotes the pattern obtained from \(\boldsymbol{v}\) by removing the arrow variable \(\boldsymbol{v}^a\), if any. In the case \(\boldsymbol{v}^+\) is empty, the notation \((\boldsymbol{v}^+, e)\) stands for e.
Fig. 6 gives the rewriting rules of let-terms that we will use in the sequel. The rewriting steps \(\gamma _1,\gamma _2,\gamma _3\) are called swapping and we write \(\ell \xrightarrow {\gamma }\ell '\) whenever \(\ell '\) is obtained from \(\ell \) by applying any such swapping step. The rewriting step \(\mu \) is called multiplicative and it is used to couple two definitions. The reason why \(\gamma _3\) is classified as swapping rather than multiplicative reflects the role of arrow variables in the definition of \(\textsf{Fs}(\ell )\). Finally, the rewriting step \(\epsilon _x\) eliminates a positive variable x from the outermost definitions, supposing this variable is not used in the sequel. The conditions in each rule guarantee that the rewriting \(\rightarrow \) preserves typing as stated by the following proposition.
Proposition 5
(Subject reduction). The rewriting \(\rightarrow \) of Fig. 6 preserves typing, i.e. if \(\ell \rightarrow \ell '\) and \(\ell \) is of type \(T\), then so is \(\ell '\), as well as \(\textsf{FV}(\ell ) =\textsf{FV}(\ell ') \).
Proposition 6
(Semantics invariance). The rewriting \(\rightarrow \) of Fig. 6 preserves the denotational interpretation, i.e. if \(\ell \rightarrow \ell '\) then \(\llbracket \ell \rrbracket =\llbracket \ell '\rrbracket \).
Moreover, \(\textsf{Fs}(\ell )\) is invariant under commutative rewriting:
Lemma 1
If \(\ell \xrightarrow {\gamma }\ell '\), then \(\textsf{Fs}(\ell ')=\textsf{Fs}(\ell )\).

4.2 Variable Elimination Strategy

The \(\textsf{VE}^\mathcal {L}\) transformation can be seen as a deterministic strategy of applying the rewriting \(\rightarrow \) in order to make local a variable in a let-term. The idea of \(\textsf{VE}^\mathcal {L}(\ell ,x)\) is the following: first, we gather together of definitions \((\boldsymbol{v}_i = e_i)\) of \(\ell \) having x free in \(e_i\) into a common huge definition \(\boldsymbol{v} = e\) and we move this latter close to the definition of x in \(\ell \); then, we make the definition of x local to e. To formalise this rewriting sequence we define two auxiliary transformations: the swapping definitions \(\textsf{SD}\) (Def. 6) and the variable anticipation \(\textsf{VA}\) (Def. 7).
The swapping definition procedure rewrites a let-term \(\ell \) with at least two definitions by swapping (or gathering) the first definition with the second one, without changing the factor representation.
Definition 6
(Swapping definitions). We define \(\textsf{SD}(\ell )\) for a let-term \(\ell \mathrel {:=}(\boldsymbol{v}_1=e_1, \boldsymbol{v}_2=e_2\, \texttt{in}\, \ell ')\) with at least two definitions. The definition splits in the following cases, depending on the dependence of \(e_2\) with respect to \(\boldsymbol{v}_1\).
1.
If \(\textsf{FV}(\boldsymbol{v}_1) \cap \textsf{FV}(e_2) = \emptyset \), \( \textsf{SD}(\ell )\mathrel {:=}(\boldsymbol{v}_2=e_2; \boldsymbol{v}_1=e_1\, \texttt{in}\, \ell ')\).
 
2.
If \(\textsf{FV}(\boldsymbol{v}_1) \cap \textsf{FV}(e_2) =\boldsymbol{x}\) is a non-empty sequence of positive variables, \( \textsf{SD}(\ell )\mathrel {:=}(g=\lambda \boldsymbol{x}.e_2; \boldsymbol{v}_1=e_1; \boldsymbol{v}_2=g \boldsymbol{x}\, \texttt{in}\, \ell ')\).
 
3.
If \(\boldsymbol{v}^a_1 = f\) and \(f \in \textsf{FV}(e_2) \), \( \textsf{SD}(\ell )\mathrel {:=}((\boldsymbol{v}^+_1,\boldsymbol{v}_2)= (\boldsymbol{v}_1=e_1\, \texttt{in}\, (\boldsymbol{v}^+_1, e_2)) \, \texttt{in}\, \ell ')\), if \(\boldsymbol{v}^+_1\) is non-empty, otherwise: \(\textsf{SD}(\ell )\mathrel {:=}(\boldsymbol{v}_2=(\boldsymbol{v}_1=e_1\, \texttt{in}\, e_2)\, \texttt{in}\, \ell ')\).
 
Notice that the above cases are exhaustive. In particular, if \(\boldsymbol{v}_1\) has some variables in common with \(\textsf{FV}(e_2) \) then either all such common variables are positive or one of them is an arrow variable f. By case inspection and Lemma 1, we get:
Lemma 2
(\(\textsf{SD}\) soundness). Given a let-term \(\ell \) with at least two definitions, then \(\ell \xrightarrow {\gamma }\textsf{SD}(\ell )\), for the swap reduction \(\gamma \) defined in Fig. 6. In particular, \(\textsf{SD}(\ell )\) is a well-typed let-term having the same type of \(\ell \) and such that \(\textsf{Fs}(\ell )=\textsf{Fs}(\textsf{SD}(\ell ))\).
Given a set of variables \(\mathcal {V}\), the variable anticipation procedure rewrites a let-term \(\ell \) into \(\textsf{VA}(\ell , \mathcal {V})\) by “gathering” in the first position all definitions having free variables in \(\mathcal {V}\) or having arrow variables defined by one of the definitions already “gathered”. This definition is restricted to positive let-terms.
Definition 7
(Variable anticipation). We define a let-term \(\textsf{VA}(\ell , \mathcal {V})\mathrel {:=}(\boldsymbol{v}'=e'\, \texttt{in}\, \ell ')\), given a positive let-term \(\ell \mathrel {:=}(\boldsymbol{v}_1=e_1\, \texttt{in}\, \ell _1)\) with at least one definition and a set of variables \(\mathcal {V}\subseteq \textsf{FV}(\ell ) \) disjoint from the output variables of \(\ell \). The definition is by structural induction on \(\ell \) and splits in the following cases.
1.
If \(\mathcal {V}=\emptyset \), then define: \( \textsf{VA}(\ell , \mathcal {V})\mathrel {:=}\ell \).
 
2.
If \(\mathcal {V}\cap \textsf{FV}(e_1) =\emptyset \), so that \(\mathcal {V}\subseteq \textsf{FV}(\ell _1) \), then define: \( \textsf{VA}(\ell , \mathcal {V})\mathrel {:=}\textsf{SD}((\boldsymbol{v}_1=e_1\, \texttt{in}\, \textsf{VA}(\ell _1, \mathcal {V})))\).
 
3.
If \(\mathcal {V}\cap \textsf{FV}(e_1) \ne \emptyset \) and \(\boldsymbol{v}_1\) is positive, then consider \(\textsf{VA}(\ell _1, \mathcal {V}\cap \textsf{FV}(\ell _1) )\mathrel {:=}(\boldsymbol{v}'=e'\, \texttt{in}\, \ell ')\) and set: \( \textsf{VA}(\ell , \mathcal {V})\mathrel {:=}((\boldsymbol{v}_1,\boldsymbol{v}')=(\boldsymbol{v}_1=e_1\, \texttt{in}\, (\boldsymbol{v}_1,e'))\, \texttt{in}\, \ell ')\).
 
4.
If \(\mathcal {V}\cap \textsf{FV}(e_1) \ne \emptyset \) and \(\boldsymbol{v}^a_1=f\). Notice that, by hypothesis, f does not appear in the output of \(\ell _1\), as \(\ell \) (and hence \(\ell _1\)) is positive. So we can consider \(\textsf{VA}(\ell _1, (\mathcal {V}\cap \textsf{FV}(\ell _1) )\cup \{f\})\mathrel {:=}(\boldsymbol{v}'=e'\, \texttt{in}\, \ell ')\) and define: \( \textsf{VA}(\ell , \mathcal {V})\mathrel {:=}( (\boldsymbol{v}^+_1,\boldsymbol{v}') = (\boldsymbol{v}_1 = e_1\, \texttt{in}\, (\boldsymbol{v}_1^+, e')) \, \texttt{in}\, \ell ')\), if \(\boldsymbol{v}^+_1\) is non-empty, otherwise: \(\textsf{VA}(\ell , \mathcal {V})\mathrel {:=}(\boldsymbol{v}'=(\boldsymbol{v}_1=e_1\, \texttt{in}\, e')\, \texttt{in}\, \ell ')\).
 
Finally, we can define the procedure \(\textsf{VE}^\mathcal {L}(\ell , x)\). This procedure basically consists in three steps: (i), it uses \(\textsf{VA}\) for gathering in a unique definition all expressions having a free occurrence of x or a free occurrence of an arrow variable depending from x; then (ii), it performs \(\mu \) and \(\epsilon \) rewriting so to make x local to a definition, and finally (iii), it uses \(\textsf{SD}\) to move the obtained definition as the first definition of the let-term. This latter step is not strictly necessary but it is convenient in order to avoid free arrow variables of the expression having x local, so getting a simple representation of the factor obtained after x “elimination”.
Definition 8
(Variable elimination strategy). The let-term \(\textsf{VE}^\mathcal {L}(\ell , x)\) is defined from a positive let-term \(\ell \mathrel {:=}\texttt{let}\, \boldsymbol{v}_1\, =\, e_1\, \texttt{in}\, \ell _1\) and a positive variable x defined in \(\ell \) but not in the output of \(\ell \). The definition is by induction on \(\ell \) and splits in the following cases.
1.
If \(x\in \textsf{FV}(\boldsymbol{v}_1) \) and \(x\notin \textsf{FV}(\ell _1) \), then write by \(\boldsymbol{v}_1'\) the pattern obtained from \(\boldsymbol{v}_1\) by removing x and define: \( \textsf{VE}^\mathcal {L}(\ell , x) \mathrel {:=}(\boldsymbol{v}_1'=(\boldsymbol{v}_1=e_1\, \texttt{in}\, \boldsymbol{v}_1')\, \texttt{in}\, \ell _1)\).
 
2.
If \(x\in \textsf{FV}(\boldsymbol{v}_1) \) and \(x\in \textsf{FV}(\ell _1) \), then write by \(\boldsymbol{v}_1'\) the pattern obtained from \(\boldsymbol{v}_1\) by removing x. Remark that \(\ell _1\) has at most one definition, as x is not in the output of \(\ell _1\). We split in two subcases:
1.
if \(\boldsymbol{v}_1'\) is positive, then set \((\boldsymbol{v}'=e'\, \texttt{in}\, \ell ')\mathrel {:=}\textsf{VA}(\ell _1,\{x\})\) and define: \( \textsf{VE}^\mathcal {L}(\ell , x) \mathrel {:=}((\boldsymbol{v}_1',\boldsymbol{v}')=(\boldsymbol{v}_1=e_1\, \texttt{in}\, (\boldsymbol{v}_1', e'))\, \texttt{in}\, \ell ')\).
 
2.
if \((\boldsymbol{v}'_1)^a=f\), then set \((\boldsymbol{v}'=e'\, \texttt{in}\, \ell ')\mathrel {:=}\textsf{VA}(\ell _1,\{x,f\})\) and define: \( \textsf{VE}^\mathcal {L}(\ell , x) \mathrel {:=}((\boldsymbol{v}_1^+,\boldsymbol{v}')=(\boldsymbol{v}_1=e_1\, \texttt{in}\, (\boldsymbol{v}_1^+, e'))\, \texttt{in}\, \ell ')\).
 
In both sub-cases, if \(\boldsymbol{v}_1'^+\) is empty, we mean \(\textsf{VE}^\mathcal {L}(\ell , x) \mathrel {:=}(\boldsymbol{v}'=(\boldsymbol{v}_1=e_1\, \texttt{in}\, e')\, \texttt{in}\, \ell ')\).
 
3.
If \(x\notin \textsf{FV}( \boldsymbol{v}_1) \), then x is defined in \(\ell _1\), and we can set: \( \textsf{VE}^\mathcal {L}(\ell , x) \mathrel {:=}\textsf{SD}((\boldsymbol{v}_1=e_1\, \texttt{in}\, \textsf{VE}^\mathcal {L}(\ell _1, x)))\).
 
As for \(\textsf{VE}^{\mathsf F}\), we extend \(\textsf{VE}^\mathcal {L}\) to sequences of (positive) variables, by
$$ \textsf{VE}^\mathcal {L}(\ell , (x_1,\dots , x_h))\mathrel {:=}\textsf{VE}^\mathcal {L}(\textsf{VE}^\mathcal {L}(\ell , x_1), (x_2,\dots , x_h))\,. $$
with the identity on \(\ell \) for the empty sequence.
Fig. 7.
Rewriting of \(\ell \) into \(\textsf{VE}^\mathcal {L}(\ell , (x_1, x_2, x_4, x_5))=\ell '\) for \(\ell ,\ell '\) given in Ex. 2. We underline in https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-91118-7_12/MediaObjects/648518_1_En_12_Figg_HTML.gif the fired redex in the following reduction step. We also name \(e_1\), \(e_2\), \(e_4\), \(e_5\), the expressions keeping local the corresponding variable (i.e. \(e_i\) keeps local \(x_i\)).
Example 12
Consider Fig. 7 and denote by \(\ell _i\) the let-term in line (Li). This figure details the rewriting sequence of the term \(\ell _1\) into \(\ell _{15}=\textsf{VE}^\mathcal {L}(\ell _1, (x_1, x_2, x_4, x_5))\). Namely, \(\ell _3=\textsf{VE}^\mathcal {L}(\ell _1, x_1)\), \(\ell _8=\textsf{VE}^\mathcal {L}(\ell _3, x_2)\), \(\ell _{11}=\textsf{VE}^\mathcal {L}(\ell _8, x_4)\), \(\ell _{15}=\textsf{VE}^\mathcal {L}(\ell _{11}, x_5)\).
Proposition 7
(Rewriting into \(\textsf{VE}^\mathcal {L}\)). Let \(\ell \) be a let-term with n definitions: \(\textsf{VE}^\mathcal {L}(\ell , x)\) is obtained from \(\ell \) by at most n steps of the \(\rightarrow \) rewriting of Fig. 6. In particular, \(\textsf{VE}^\mathcal {L}(\ell , x)\) has the same type and free variables of \(\ell \).
The following theorem states both the soundness and completeness of our syntactic definition of \(\textsf{VE}^\mathcal {L}\) with respect to the more standard version defined on factors. The soundness is because any syntactic elimination variable is equivalent to the semantic \(\textsf{VE}^\mathcal {L}\) modulo the map \(\textsf{Fs}(\ell )\). Completeness is because this holds for any chosen variable, so all variable elimination sequences can be simulated in the syntax (Corollary 1).
Theorem 1
Given \(\ell \) and x as in Def. 8, we have: \( \textsf{Fs}(\textsf{VE}^\mathcal {L}(\ell , x)) = \textsf{VE}^{\mathsf F}(\textsf{Fs}(\ell ), x)\).
From Theorem 1 and Def. 5 and 8, the following is immediate.
Corollary 1
Given a let-term \(\ell \) with all output variables positive and given a sequence \((x_1,\dots ,x_n)\) of positive variables defined in \(\ell \) and not appearing in the output of \(\ell \), we have that: \( \textsf{VE}^{\mathsf F}(\textsf{Fs}(\ell ),(x_1,\dots , x_n)) = \textsf{Fs}(\textsf{VE}^\mathcal {L}(\ell , (x_1,\dots , x_n))) \).
Recall from Ex. 2 that Bayesian networks can be represented by let-terms, so the above result shows that \(\textsf{VE}^\mathcal {L}\) implements in \(\mathcal {L}\) the elimination of a set of random variables of a Bayesian network in any possible order. It is well-known that the variable elimination algorithm may produce intermediate factors that are not stochastic matrices. The standard literature on probabilistic graphical models refer to the intermediate factors simply as vectors of non-negative real numbers, missing any finer characterisation. We stress that our setting allows for a more precise characterisation of such factors, as they are represented by well-typed terms of \(\mathcal {L}\): not all non-negative real numbers vectors fit in. In particular, the typing system suggests a hierarchy of the complexity of a factor that, by recalling Remark 1, can by summarised by the alternation between direct sums \(\oplus \) and products &: the simplest factors have type \(\bigoplus _n1\), i.e. probabilistic distributions over n values, then we have those of type https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-91118-7_12/MediaObjects/648518_1_En_12_Figh_HTML.gif , i.e. stochastic matrices describing probabilities over n values conditioned from observations over m values, then we have more complex factors of type https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-91118-7_12/MediaObjects/648518_1_En_12_Figi_HTML.gif , i.e. probabilistic distributions over stochastic matrices, and so forth.
Complexity Analysis. Prop. 7 gives a bound to the number of \(\rightarrow \) steps needed to rewrite \(\ell \) into \(\textsf{VE}^\mathcal {L}(\ell , x)\), however some of these steps adds new definitions in the rewritten let-term. The size of \(\textsf{VE}^\mathcal {L}(\ell , x)\), although greater in general than that of \(\ell \), stays reasonable, in fact it has an upper bound linear in the degree of \(\textsf{Fs}(\ell )_x\) (Prop. 8). We define the size of an expression as follows:
$$\begin{aligned} \textsf{s}(v)&\mathrel {:=}1 &\textsf{s}(\lambda \boldsymbol{v}.e)&\mathrel {:=}\textsf{s}(\boldsymbol{v}) + \textsf{s}(e) &\textsf{s}((e,e'))&\mathrel {:=}\textsf{s}(e)+\textsf{s}(e') \\ \textsf{s}(f \boldsymbol{x})&\mathrel {:=}1+\textsf{s}(\boldsymbol{x}) &\textsf{s}(\mathtt M(\boldsymbol{x}))&\mathrel {:=}1+\textsf{s}(\boldsymbol{x}) &\textsf{s}((\boldsymbol{v} = e\, \texttt{in}\, e'))&\mathrel {:=}\textsf{s}(\boldsymbol{v}) + \textsf{s}(e) + \textsf{s}(e') \end{aligned}$$
By induction on \(\ell \), we obtain the following:
Proposition 8
Given a let-term \(\ell \) and a positive variable x as in Def. 8, we have that \(\textsf{s}(\textsf{VE}^\mathcal {L}(\ell , x))\le \textsf{s}(\ell ) + 4 \times \textsf{s}(\textsf{Var} (\textsf{Fs}(\ell )_x)\setminus \textsf{FV}(\ell ) )\).

5 Conclusions and discussion

We have identified a fragment \(\mathcal {L}\) of the linear simply-typed \(\lambda \)-calculus which can express syntactically any factorisation induced by a run of the variable elimination algorithm over a Bayesian network. In particular, we define a rewriting (Fig. 6) and a reduction strategy \(\textsf{VE}^\mathcal {L}\) (Def. 8) that, given a sequence \((x_1,\dots , x_n)\) of variables to eliminate, transforms in \(O(n\textsf{s}(\ell ))\) steps a let-term \(\ell \) into a let-term \(\textsf{VE}^\mathcal {L}(\ell , (x_1,\dots , x_n))\) associated with the factorisation generated by the \((x_1,\dots , x_n)\) elimination (Corollary 1). We have proven that the size of \(\textsf{VE}^\mathcal {L}(\ell , (x_1,\dots , x_n))\) is linear in the size of \(\ell \) and in the number of variables involved in the elimination process (Prop. 8).
Our language is a fragment of a more expressive one [15], in which several classes of stochastic models can be encoded. Our work is therefore a step towards defining standard exact inference algorithms on a general-purpose stochastic language, as first propounded in [27] with the goal is to have general-purpose algorithms of reasonable cost, usable on any model expressed in the language.
While it is known (see [26], Sect. 9.3.1.3) that \(\textsf{VE}\) produces intermediate factors that are not conditional probabilities—i.e. not stochastic matrices— our approach is able to associate a term and a type to such factors. In fact, the types of the calculus \(\mathcal {L}\) give a logical description of the interdependences between the factors generated by the \(\textsf{VE}\) algorithm: the grammar is more expressive than just the types of stochastic matrices between tuples of booleans (Remark 1).
Discussion and perspectives. Since our approach is theoretical, the main goal has been to give a formal framework for proving the soundness and the completeness of \(\textsf{VE}^\mathcal {L}\). For that sake, the rewriting rules of Fig. 6 are reduced to a minimum, in order to keep reasonable the number of cases in the proofs. The drawback is that the rewritten terms have a lot of bureaucratic code, as the reader may realize by looking at Fig. 7. Although this fact is not crucial from the point of view of the asymptotic complexity, when aiming at a prototypical implementation, one may enrich the rewriting system with more rules to avoid useless code.
The grammar of let-terms recalls the notion of administrative normal form (abbreviated ANF), which is often used as an intermediate representation in the compilation of functional programs. In particular, let-terms and ANF share in common the restriction of applications to variables, so suggesting a precise evaluation order. Several optimisations are defined as transformations over ANF, even considering some let-floating rules analogous to the ones considered in Fig. 6, see e.g. [38]. Comparing these optimisations is not trivial as the cost model is different. E.g. [38] aims to reduce heap allocations, while here we are factoring algebraic expressions to minimise floating-point operations. We plan to investigate more in detail the possible interplay/interference between these techniques.
The quest for optimal factorisations is central not only to Bayesian programming. In particular, these techniques can be applied to large fragments of \(\lambda \)-calculus, suggesting heuristics for making tractable the computation of the quantitative semantics of other classes of \(\lambda \)-terms than the one identified by \(\mathcal {L}\). This is of great interest in particular because these semantics are relevant in describing quantitative observational equivalences, as hinted for example by the full-abstraction results achieved in probabilistic programming, e.g.  [6, 13, 14].
Finally, while we have stressed that our work is theoretical, we do not mean to say that foundational understanding in general, and this work in particular, is irrelevant to the practice. Let us mention one such perspective. Factored inference is central to inference in graphical models, but scaling it up to the more complex problems expressible as probabilistic programs proves difficult—research in this direction is beginning, mainly guided by implementation techniques [21, 32, 40]. We believe that a foundational understanding of factorisation on the structure of the program—starting from the most elementary algorithms, as we do here— is also important to allow progress in this direction.
On dealing with evidence. We have focused on the computation of marginals, without explicitly treating posteriors. Our approach could easily be adapted to deal with evidence (hence, posteriors), by extending syntax and rewriting rules to include an observe construct as in [21] or in [16].

Acknowledgments

We are deeply grateful to Marco Gaboardi for suggesting investigating the link between variable elimination and linear logic, as well as to Robin Lemaire, with whom we initiated this research. We also thank the anonymous referees for their many valuable suggestions to improve the paper.
This work has been supported by the ANR grant PPS ANR-19-CE48-0014 and ENS de Lyon research grant.
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://​creativecommons.​org/​licenses/​by/​4.​0/​), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Anhänge

Electronic supplementary material

Below is the link to the electronic supplementary material.
Fußnoten
1
The results trivially extends to random variables over countable sets of outcomes.
 
2
Actually, the expressions \(e^1\) and \(e^2\) in Fig. 7 are a bit more cumbersome that the ones here discussed because of some bureaucratic let-in produced by a more formal treatment. This difference is inessential and can be avoided by adding a post-processing.
 
Literatur
3.
Zurück zum Zitat Chavira, M., Darwiche, A.: Compiling bayesian networks with local structure. In: Kaelbling, L.P., Saffiotti, A. (eds.) IJCAI-05, Proceedings of the Nineteenth International Joint Conference on Artificial Intelligence, Edinburgh, Scotland, UK, July 30 - August 5, 2005. pp. 1306–1312. Professional Book Center (2005), http://ijcai.org/Proceedings/05/Papers/0931.pdf Chavira, M., Darwiche, A.: Compiling bayesian networks with local structure. In: Kaelbling, L.P., Saffiotti, A. (eds.) IJCAI-05, Proceedings of the Nineteenth International Joint Conference on Artificial Intelligence, Edinburgh, Scotland, UK, July 30 - August 5, 2005. pp. 1306–1312. Professional Book Center (2005), http://​ijcai.​org/​Proceedings/​05/​Papers/​0931.​pdf
7.
Zurück zum Zitat Danos, V., Ehrhard, T.: Probabilistic coherence spaces as a model of higher-order probabilistic computation. Information and Computation 209(6), 966–991 (2011) Danos, V., Ehrhard, T.: Probabilistic coherence spaces as a model of higher-order probabilistic computation. Information and Computation 209(6), 966–991 (2011)
13.
Zurück zum Zitat Ehrhard, T., Pagani, M., Tasson, C.: Probabilistic Coherence Spaces are Fully Abstract for Probabilistic PCF. In: Sewell, P. (ed.) The 41th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL14, San Diego, USA. ACM (2014) Ehrhard, T., Pagani, M., Tasson, C.: Probabilistic Coherence Spaces are Fully Abstract for Probabilistic PCF. In: Sewell, P. (ed.) The 41th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL14, San Diego, USA. ACM (2014)
18.
Zurück zum Zitat Girard, J.Y.: Linear logic. Theor. Comput. Sci. 50, 1–102 (1987) Girard, J.Y.: Linear logic. Theor. Comput. Sci. 50, 1–102 (1987)
20.
Zurück zum Zitat Hindley, J.R., Seldin, J.P.: Lambda-Calculus and Combinators: An Introduction. Cambridge University Press, USA, 2 edn. (2008) Hindley, J.R., Seldin, J.P.: Lambda-Calculus and Combinators: An Introduction. Cambridge University Press, USA, 2 edn. (2008)
23.
25.
Zurück zum Zitat Jacobs, B., Zanasi, F.: The logical essentials of bayesian reasoning. In: Foundations of Probabilistic Programming, pp. 295 – 332. Cambridge University Press (2020) Jacobs, B., Zanasi, F.: The logical essentials of bayesian reasoning. In: Foundations of Probabilistic Programming, pp. 295 – 332. Cambridge University Press (2020)
26.
Zurück zum Zitat Koller, D., Friedman, N.: Probabilistic Graphical Models: Principles and Techniques. The MIT Press (2009) Koller, D., Friedman, N.: Probabilistic Graphical Models: Principles and Techniques. The MIT Press (2009)
27.
Zurück zum Zitat Koller, D., McAllester, D.A., Pfeffer, A.: Effective bayesian inference for stochastic programs. In: Kuipers, B., Webber, B.L. (eds.) Proceedings of the Fourteenth National Conference on Artificial Intelligence and Ninth Innovative Applications of Artificial Intelligence Conference, AAAI 97, IAAI 97, July 27-31, 1997, Providence, Rhode Island, USA. pp. 740–747. AAAI Press / The MIT Press (1997), http://www.aaai.org/Library/AAAI/1997/aaai97-115.php Koller, D., McAllester, D.A., Pfeffer, A.: Effective bayesian inference for stochastic programs. In: Kuipers, B., Webber, B.L. (eds.) Proceedings of the Fourteenth National Conference on Artificial Intelligence and Ninth Innovative Applications of Artificial Intelligence Conference, AAAI 97, IAAI 97, July 27-31, 1997, Providence, Rhode Island, USA. pp. 740–747. AAAI Press / The MIT Press (1997), http://​www.​aaai.​org/​Library/​AAAI/​1997/​aaai97-115.​php
29.
Zurück zum Zitat Laird, J., Manzonetto, G., McCusker, G., Pagani, M.: Weighted relational models of typed lambda-calculi. In: 28th Annual ACM/IEEE Symposium on Logic in Computer Science, LICS 2013, New Orleans, LA, USA, June 25-28, 2013. IEEE Computer Society (Jun 2013) Laird, J., Manzonetto, G., McCusker, G., Pagani, M.: Weighted relational models of typed lambda-calculi. In: 28th Annual ACM/IEEE Symposium on Logic in Computer Science, LICS 2013, New Orleans, LA, USA, June 25-28, 2013. IEEE Computer Society (Jun 2013)
35.
Zurück zum Zitat Obermeyer, F., Bingham, E., Jankowiak, M., Pradhan, N., Chiu, J.T., Rush, A.M., Goodman, N.D.: Tensor variable elimination for plated factor graphs. In: Chaudhuri, K., Salakhutdinov, R. (eds.) Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA. Proceedings of Machine Learning Research, vol. 97, pp. 4871–4880. PMLR (2019), http://proceedings.mlr.press/v97/obermeyer19a.html Obermeyer, F., Bingham, E., Jankowiak, M., Pradhan, N., Chiu, J.T., Rush, A.M., Goodman, N.D.: Tensor variable elimination for plated factor graphs. In: Chaudhuri, K., Salakhutdinov, R. (eds.) Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA. Proceedings of Machine Learning Research, vol. 97, pp. 4871–4880. PMLR (2019), http://​proceedings.​mlr.​press/​v97/​obermeyer19a.​html
36.
Zurück zum Zitat PaqPaquet, H.: Bayesian strategies: probabilistic programs as generalised graphical models. In: Yoshida, N. (ed.) Programming Languages and Systems - 30th European Symposium on Programming, ESOP 2021, Held as Part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2021, Luxembourg City, Luxembourg, March 27 - April 1, 2021, Proceedings. Lecture Notes in Computer Science, vol. 12648, pp. 519–547. Springer (2021). https://doi.org/10.1007/978-3-030-72019-3_19 PaqPaquet, H.: Bayesian strategies: probabilistic programs as generalised graphical models. In: Yoshida, N. (ed.) Programming Languages and Systems - 30th European Symposium on Programming, ESOP 2021, Held as Part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2021, Luxembourg City, Luxembourg, March 27 - April 1, 2021, Proceedings. Lecture Notes in Computer Science, vol. 12648, pp. 519–547. Springer (2021). https://​doi.​org/​10.​1007/​978-3-030-72019-3_​19
37.
Zurück zum Zitat Pearl, J.: Probabilistic reasoning in intelligent systems - networks of plausible inference. Morgan Kaufmann series in representation and reasoning, Morgan Kaufmann (1989) Pearl, J.: Probabilistic reasoning in intelligent systems - networks of plausible inference. Morgan Kaufmann series in representation and reasoning, Morgan Kaufmann (1989)
39.
Zurück zum Zitat Pfeffer, A.: The design and implementation of ibal:a general-purpose probabilistic language. Introduction to Statistical Relational Learning p. 399 (2019) Pfeffer, A.: The design and implementation of ibal:a general-purpose probabilistic language. Introduction to Statistical Relational Learning p. 399 (2019)
40.
Zurück zum Zitat Pfeffer, A., Ruttenberg, B.E., Kretschmer, W., O’Connor, A.: Structured factored inference for probabilistic programming. In: Storkey, A.J., Pérez-Cruz, F. (eds.) International Conference on Artificial Intelligence and Statistics, AISTATS 2018, 9-11 April 2018, Playa Blanca, Lanzarote, Canary Islands, Spain. Proceedings of Machine Learning Research, vol. 84, pp. 1224–1232. PMLR (2018), http://proceedings.mlr.press/v84/pfeffer18a.html Pfeffer, A., Ruttenberg, B.E., Kretschmer, W., O’Connor, A.: Structured factored inference for probabilistic programming. In: Storkey, A.J., Pérez-Cruz, F. (eds.) International Conference on Artificial Intelligence and Statistics, AISTATS 2018, 9-11 April 2018, Playa Blanca, Lanzarote, Canary Islands, Spain. Proceedings of Machine Learning Research, vol. 84, pp. 1224–1232. PMLR (2018), http://​proceedings.​mlr.​press/​v84/​pfeffer18a.​html
41.
Zurück zum Zitat Sabry, A., Wadler, P.: A reflection on call-by-value. ACM Trans. Program. Lang. Syst. 19(6), 916–941 (1997) Sabry, A., Wadler, P.: A reflection on call-by-value. ACM Trans. Program. Lang. Syst. 19(6), 916–941 (1997)
43.
Zurück zum Zitat Stein, D., Staton, S.: Compositional semantics for probabilistic programs with exact conditioning. 2021 36th Annual ACM/IEEE Symposium on Logic in Computer Science (LICS) pp. 1–13 (2021) Stein, D., Staton, S.: Compositional semantics for probabilistic programs with exact conditioning. 2021 36th Annual ACM/IEEE Symposium on Logic in Computer Science (LICS) pp. 1–13 (2021)
45.
Zurück zum Zitat Zhang, N., Poole, D.: A simple approach to bayesian network computations. In: Proceedings of the 10th Biennial Canadian Artificial Intelligence Conference. p. 171-178. AAAI Press / The MIT Press (1994) Zhang, N., Poole, D.: A simple approach to bayesian network computations. In: Proceedings of the 10th Biennial Canadian Artificial Intelligence Conference. p. 171-178. AAAI Press / The MIT Press (1994)
46.
Zurück zum Zitat Zhou, Y., Yang, H., Teh, Y.W., Rainforth, T.: Divide, conquer, and combine: a new inference strategy for probabilistic programs with stochastic support. In: Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event. Proceedings of Machine Learning Research, vol. 119, pp. 11534–11545. PMLR (2020), http://proceedings.mlr.press/v119/zhou20e.html Zhou, Y., Yang, H., Teh, Y.W., Rainforth, T.: Divide, conquer, and combine: a new inference strategy for probabilistic programs with stochastic support. In: Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event. Proceedings of Machine Learning Research, vol. 119, pp. 11534–11545. PMLR (2020), http://​proceedings.​mlr.​press/​v119/​zhou20e.​html
Metadaten
Titel
Variable Elimination as Rewriting in a Linear Lambda Calculus
verfasst von
Thomas Ehrhard
Claudia Faggian
Michele Pagani
Copyright-Jahr
2025
DOI
https://doi.org/10.1007/978-3-031-91118-7_12