6.2.1 Theoretical properties of TMFBS
We proceed with the theoretical properties of TMFBS. Let
\({\mathcal {A}}\) be an admissible pair of functions
\(\langle \textsc {OrderVariables}, \textsc {BackwardPhase} \rangle \). We show that TMFBS identifies all equivalent solutions if the following assumption
4 holds.
It is important to notice that Assumption
1 refers to
any dataset
\({\mathcal {D}}\), and therefore implies that TFBS using
\({\mathcal {A}}\) can also find a Markov blanket of
T in any embedded dataset in
\({\mathcal {D}}\) (or any dataset which contains
\({\mathcal {D}}\)); however, note that a Markov blanket in an embedded dataset is not necessarily a Markov blanket in the original
\({\mathcal {D}}\).
Depending on
\({\mathcal {A}}\), the distributions for which Assumption
1 holds differ. For example, for FBS
5 it has been shown that Assumption
1 holds (assuming an oracle for testing conditional independence) for distributions that satisfy the local composition property (Statnikov et al.
2013), i.e., if
\(T {\perp } {\mathbf {X}} \ | \ {\mathbf {Z}} \wedge T {\perp } {\mathbf {Y}} \ | \ {\mathbf {Z}} \Rightarrow T {\perp } {\mathbf {X}} \cup {\mathbf {Y}} \ | \ {\mathbf {Z}}\) holds for any sets
\({\mathbf {X}}, {\mathbf {Y}}\) and
\({\mathbf {Z}}\). Conditions under which HITON-PC (Aliferis et al.
2003) identifies all solutions are given in (Statnikov et al.
2013). We note that, there is no general recipe to identify for which distributions Assumption
1 holds for arbitrary
\({\mathcal {A}}\). However, for algorithms that can be connected to probabilistic graphical models (such as FBS and HITON-PC), one can use the theory of probabilistic graphical models as a guide to find conditions under which it holds (see (Statnikov et al.
2013)). We proceed with the main result.
6.2.2 Sound rules for pruning the search space
In this section we will introduce several sound rules for speeding TMFBS up, which are direct consequences of Theorem
1, and will prove their correctness.
Theorem
1 states that TMFBS will find all solutions in any
\({\mathcal {D}}\), and therefore also in any embedded dataset
\({\mathcal {D}}^{{\mathbf {E}}}\) in
\({\mathcal {D}}\). An immediate consequence of this is that, if no solution is found in some dataset
\({\mathcal {D}}^{{\mathbf {E}}}\), no solution can be found in any embedded dataset of
\({\mathcal {D}}^{{\mathbf {E}} \cup \mathbf {E'}}\); if there was one in
\({\mathcal {D}}^{{\mathbf {E}} \cup \mathbf {E'}}\), it would also be contained
\({\mathcal {D}}^{{\mathbf {E}}}\) as
\({\mathcal {D}}^{{\mathbf {E}} \cup \mathbf {E'}}\) is embedded in
\({\mathcal {D}}^{{\mathbf {E}}}\).
Based on this, we propose and incorporate the following rules in TMFBS for pruning the search space and speeding it up.
In TMFBS, Rule
2 is checked before the recursive call to
TMFBS(
\({\mathcal {D}}^{C_{1:i-1}}\),
\({\mathbf {S}} \cup \{C_{i}\}\)) (Line
14 in Algorithm 3), while Rule
1 is checked afterwards. We note that Rule
2 is one of the conditions of the IGS procedure used by TIE* (see fourth bullet and step 1 of Figure 9 in (Statnikov et al.
2013)).
Recall that, after including a variable
\(C_i\) in state
\(t_i\), none of its subsequent siblings
\(t_{i+1}, \ldots , t_k\) will consider
\(C_i\) again, and thus increasingly smaller embedded datasets are explored. Thus, if
TMFBS(
\({\mathcal {D}}^{C_{1:i-1}}\),
\({\mathbf {S}} \cup \{C_{i}\}\)) does not lead to a solution, by Corollary
1 neither can any call to TMFBS with datasets embedded in
\({\mathcal {D}}^{C_{1:i-1}}\). Although Rule
1 identifies many cases implied by Corollary
1 that can be pruned, it does not necessarily identify all of them. Combining it with Rule
2, which is basically a direct application of Corollary
1, ensures completeness. We note that, theoretically Rule
1 is not required; however, in contrast to Rule
2, which requires to keep track of all embedded datasets that did not lead to any solution, Rule
1 can be implemented trivially and efficiently.
The previous rules only consider the forward phase of TMFBS. We identified another pruning rule, which regards the backward phase of the algorithm.
Rule
3 is checked after the recursive call to
TMFBS(
\({\mathcal {D}}^{C_{1:i-1}}\),
\({\mathbf {S}} \cup \{C_{i}\}\)) (after Line
14 in Algorithm 3).
Pruning Rule
3 regards cases where variable
\(C_i\) is selected at some step but is not in any solution in
\({\mathcal {D}}^{C_{1:i-1}}\), which can happen if
\(C_i\) is removed during the backward phase from all of them. Basically, if including
\(C_i\) leads to some solutions (i.e., Rule
1 does not apply), but none of them actually contains
\(C_i\),
\(C_i\) was a false positive which got removed by the backward phase. Thus, the call
TMFBS(
\({\mathcal {D}}^{C_{1:i-1}}\),
\({\mathbf {S}} \cup \{C_{i}\}\)) would give the same results as
TMFBS(
\({\mathcal {D}}^{C_{1:i}}\),
\({\mathbf {S}}\)), which is the same as the results of all remaining states in the current recursive call (i.e.,
TMFBS(
\({\mathcal {D}}^{C_{1:i}}\),
\({\mathbf {S}} \cup \{C_{i+1}\}\)),
\(\ldots \),
TMFBS(
\({\mathcal {D}}^{C_{1:k-1}}\),
\({\mathbf {S}} \cup \{C_{k}\}\))) Because of that, there is no reason to consider the remaining candidates
\(C_{i+1:k}\), which have already been implicitly considered, and
\({\mathbf {M}}\) can be returned.
Despite all attempts to speed-up TMFBS, the number of candidate solutions may still be exponential in the number of variables, and thus TMFBS may not terminate in a reasonable time frame.
This is not a weakness of TMFBS, but an inherent property of the problem. Thus, to avoid such cases in practice, we recommend setting a limit on the number of candidate states to consider, the number of solutions to return, or a combination of both. This problem also motivated us to develop an algorithm for summarizing and visualizing multiple solutions, presented in Sect.
7.
6.2.3 Computational complexity
In this section we present an analysis of the complexity of TMFBS. Due to the nature and inherent intractability of the problem, it is hard to quantify accurately and in a useful manner the computational complexity of TMFBS. The complexity depends on many factors, such as (a) the cost of OrderVariables and BackwardPhase for the specific instantiation of TMFBS, (b) the cost of testing whether a candidate solution is equivalent to the reference solution using Equivalent, (c) the search strategy of TMFBS induced by OrderVariables, which affects when TMFBS encounters equivalent solutions, (d) the total number of solutions, (e) the size of solutions, (f) whether the pruning rules trigger or not, and (g) the properties of the input dataset (e.g., number of samples and variables). To simplify analysis, we will make the following assumptions.
First, the costs of
BackwardPhase and
Equivalent are negligible compared to the cost of the forward phase, which is dominated by
OrderVariables. We use
f(
n) to denote the time taken by
OrderVariables to rank
n variables. To make things even simpler, we will implicitly assume that
n is set to the total number of variables in the input dataset, and will simply refer to the cost as
f. Also, in practice the backward phase typically removes only a small number of variables, compared to the ones selected during the forward phase (see also Section 4.4 in (Statnikov et al.
2013)). Given the above, and because pruning rule
3 will trigger for variables which were removed during the backward phase, avoiding further search,
we will only focus on variables which were added during the forward phase and are not removed afterwards. Next,
we will ignore any pruning, unless stated otherwise, as this depends on
OrderVariables and the search space it produces. Finally, we will use an upper bound on the size of the candidate solutions, denoted as
L, allowing us to focus on the worst case scenario. We proceed with the analysis.
Finding the next candidate solution. An interesting and useful way to look at the problem, is to measure the time required for identifying a new candidate solution given the previous candidate solution at a search tree node N. The exact cost depends on the location of N on the search tree. First, note that after backtracking the algorithm will continue from some node on the path from the root of the search tree to N. Each such node already contains a set of selected variables, with the size depending on the position of the node on the path (e.g., the i-th node already has i selected variables). As the maximum size of any candidate solution is L, the cost of finding the next candidate solution is between f and \(L \cdot f\). Specifically, for the i-th node on the path, the cost is \((L-i) \cdot f\).
Proving there are no more solutions. Another interesting aspect is the cost of deciding whether there are additional equivalent solutions. Again, assume that TMFBS arrived at some root node N. TMFBS will backtrack from the leaf node N to the root node, one node at a time. In case there are no more solutions, TMFBS will explore at most one candidate solution from each node to some leaf node, as the pruning rules will trigger. As each node on the from the root to N is considered, the total cost equals \(\sum _{i=1}^{L} i \cdot f = O(L^2 \cdot f)\). Using this result, one can show that, in case of a single solution, the total cost equals the cost of finding the reference solution (\(O(L \cdot f)\)) plus the cost of proving there are no more solutions, i.e., \(O(L^2 \cdot f)\).