1 Introduction
-
We implement proof search based on clausal and nonclausal connection tableaux calculi in functional programming languages, improving performance upon previous Prolog-based implementations, see Sect. 3.
-
We show a method to order proof search steps by using a Naive Bayes classifier based on previous proofs, see Sect. 4.
-
We use Monte Carlo Tree Search to guide connection proof search, see Sect. 5. To this end, we propose and evaluate several proof state evaluation heuristics, including two that learn from previous proofs.
2 Connection Calculi
3 Functional-Style Connection Prover
3.1 Problem Preprocessing
-
For every n-ary function f, the formula \(x_1 = y_1 \rightarrow \dots \rightarrow x_n = y_n \rightarrow f(x_1, \dots , x_n) = f(y_1, \dots , y_n)\) is introduced.
-
For every n-ary predicate P, the formula \(x_1 = y_1 \rightarrow \dots \rightarrow x_n = y_n \rightarrow P(x_1, \dots , x_n) \rightarrow P(y_1, \dots , y_n)\) is introduced.
3.2 Consistent Skolemisation
3.3 Connection Search
3.4 Proof Search
3.4.1 Prolog
prove(
C, Path, PathLim)
succeeds iff there exists a closed proof tree for \(\langle C, M, Path \rangle \) with a maximal path length of PathLim
. For this, prove
attempts to close the proof tree for the first literal Lit
of C in lines 4–9, and if successful, it continues with the remaining clause Cla
of C in line 10.Lit
: Line 4 corresponds to the reduction rule: The branch is closed if the negation of Lit
can be unified with a literal on the Path. Lines 6–8 correspond to the extension rule: The contrapositive database as explained in Sect. 3.3 is implemented by the predicate lit(
L, C)
, which succeeds iff the matrix contains some clause that can be unified with \(\left\{ L\right\} \cup C\). This is used to obtain some contrapositive Cla1
for the negation of Lit
. If the path does not exceed the length limit (line 7), new branches are opened for Cla1
in line 8.Lit
leads to the proof search getting stuck, the next contrapositive will be tried by Prolog.
3.4.2 Lazy Lists and Streams
-
x
&f
denotesf
x
. -
\(\backslash \)
x
->
y
stands for a lambda term \(\lambda x. y\). -
unify
sub
lit1
lit2
unifies two literalslit1
andlit2
under a substitutionsub
, returning a new substitution if successful. -
unifyDB
sub
lit
finds all contrapositives in the database which could match the literallit
under the substitutionsub
. It returns a list of substitution-contrapositive pairs. It corresponds to thelit
predicate in the Prolog version. -
mapMaybe f l
returns the results off
for the elements ofl
on whichf
succeeded. -
concatMap
f
l
mapsf
over all elements ofl
and concatenates the resulting list of lists to form a flat list. -
x
++
y
is the concatenation of two listsx
andy
.
prove
\(C\ \hbox {Path}\) lim
\(\sigma \) returns a list of substitutions \([\sigma _1, \dots , \sigma _n]\), where every substitution \(\sigma _i\) corresponds to a closed proof tree for \(\langle C, M, \hbox {Path} \rangle \) with a maximal path length smaller than lim
, where the global initial substitution is \(\sigma \) and the final substitution is \(\sigma _i\).4 Similarly to the Prolog version, prove
attempts to close the proof tree for the first literal lit
of C in lines 4–8, and the resulting substitutions are used to close the proof trees for the remaining clause cla
of C in line 9. Line 4 corresponds to the reduction rule, and lines 5–8 correspond to the extension rule.5 As we use lazy lists / streams, a substitution \(\sigma _ i\) is only calculated if proof search failed for all \(\sigma _j\) with \(j < i\).
3.4.3 Continuations
prove
\(C\ { Path}\) lim
\(\sigma \) alt rem
searches for a closed proof tree for \(\langle C, M, Path \rangle \) with a maximal path length smaller than lim
under the substitution \(\sigma \). If prove
finds such a proof tree, it calls the rem
continuation to treat remaining proof obligations (line 1). Otherwise, prove
calls the alt
continuation to backtrack to an alternative (line 16). The reduce
function in lines 3–7 corresponds to the reduction rule, and the extend
function in lines 10–15 corresponds to the extension rule. If no more reductions can be performed, extensions are tried (line 8), and if no more extensions can be performed, we backtrack (line 16). Both reduce
and extend
define a continuation alt1
(line 4 and 11) to provide a way to backtrack to the current state and pass it to prove
(line 7 and 15). The extend
function additionally defines a continuation rem1
(line 14), which serves to continue proof search for the clause cla
once a proof for the contrapositive clause cla1
was found (line 15).
3.4.4 Stacks
prove
function has the same arguments as the prove
function of the stream-based implementation, plus a stack. This stack contains tuples with information about clauses that still have to be processed, together with the depth at which the clauses have been put onto the stack. Once the current clause has been completely refuted, the next tuple is popped from the stack and the clause in the tuple is processed.3.5 Evaluation
-
MPTP2078 [2] contains 2078 problems exported from the Mizar Mathematical Library. This dataset is particularly suited for symbolic machine learning since symbols are shared between problems. It comes in the two flavours “bushy” and “chainy”: In the “chainy” dataset, every problem contains all facts stated before the problem, whereas in the “bushy” dataset, every problem contains only the Mizar premises required to prove the problem.
-
Miz40 contains the problems from the Mizar library for which at least one ATP proof has been found using one of the 14 combinations of provers and premise selection methods considered in [40]. The problems are translated to untyped first-order logic using the MPTP infrastructure [81]. Symbol names are also used consistently in this dataset, and the problems are minimised using ATP-based minimisation, i.e., re-running the ATP only with the set of proof-needed axioms until this set no longer becomes smaller. This typically leads to even better axiom pruning and ATP-easier problems than in the Mizar-based pruning used for the “bushy” version above.
-
HOL Light: We translate theorems proven in HOL Light to first-order logic, following a similar procedure as [37]. We export top-level theorems (“top”) as well as theorems proven by the MESON tactic (“msn”).6 We consider the theorems proven in the core of HOL Light (“HL”) as well as those proven by the Flyspeck project (“FS”), which finished in 2014 a formal proof of the Kepler conjecture [27].
Dataset | TPTP | MPTP | Miz40 | HL-top | HL-msn | FS-top | FS-msn |
---|---|---|---|---|---|---|---|
Problems | 7492 | 2078 | 32,524 | 2498 | 1108 | 27,111 | 39,979 |
-
-auto-schedule
and Vampire with -
-mode
casc
. In addition, we evaluated the ATP Metis [33]: It implements the ordered paramodulation calculus (having inference rules for equality just like the superposition calculus), but is considerably smaller than Vampire and E and is implemented in a functional language, making it more comparable to our work.Prover | TPTP | Bushy | Chainy | Miz40 | FS-top | FS-msn |
---|---|---|---|---|---|---|
Vampire | 4404 | 1253 | 656 | 30,341 | 6358 | 39,760 |
E | 3664 | 1167 | 287 | 26,003 | 7382 | 39,740 |
Metis | 1376 | 500 | 75 | 18,519 | 3537 | 38,625 |
fleanCoP\(+\)cut\(+\)conj | 1859 | 670 | 289 | 12,204 | 3980 | 35,738 |
fleanCoP\(+\)cut−conj | 1782 | 598 | 244 | 11,796 | 3520 | 30,668 |
fleanCoP−cut\(+\)conj | 1617 | 499 | 192 | 7826 | 3849 | 35,204 |
fleanCoP−cut−conj | 1534 | 514 | 164 | 11,115 | 3492 | 36,334 |
pleanCoP\(+\)cut\(+\)conj | 1673 | 606 | 182 | 11,243 | 3664 | 35,234 |
pleanCoP\(+\)cut−conj | 1621 | 548 | 153 | 11,227 | 3305 | 30,416 |
pleanCoP−cut\(+\)conj | 1428 | 453 | 143 | 7287 | 3671 | 34,437 |
pleanCoP−cut−conj | 1374 | 460 | 123 | 10,442 | 3415 | 35,499 |
fnanoCoP\(+\)cut | 1724 | 511 | 192 | 12,332 | 3178 | 30,327 |
fnanoCoP−cut | 1567 | 542 | 151 | 13,316 | 1993 | 37,938 |
pnanoCoP\(+\)cut | 1585 | 480 | 112 | 11,921 | 2970 | 30,272 |
pnanoCoP−cut | 1485 | 510 | 126 | 12,943 | 1986 | 38,015 |
Implementation | Solved | Inferences |
---|---|---|
Prolog | 606 | – |
Lazy list | 639 | 878199349 |
Stack (list substitution) | 648 | 1253862954 |
Stream | 670 | 1702827032 |
Continuation | 681 | 2200272406 |
Stack | 681 | 2490100879 |
Implementation | Solved | Inferences |
---|---|---|
Prolog | 480 | – |
Lazy list | 504 | 374849495 |
Streams | 511 | 495368962 |
4 Naive Bayesian Internal Guidance
4.1 Tableau Branch Characterisation
4.2 Naive Bayes
-
We add a term to discriminate against features not present in \(\mathbf{f}\) that occurred in previous situations with the label \(l_i\).
-
We weigh the probability of any feature f by its inverse document frequency \({{\,\mathrm{idf}\,}}(f)\) to give more weight to rare features.
-
We drop the term \(\ln P(\mathbf{f})\), as we compare only values for fixed features \(\mathbf{f}\).
-
We weigh the individual parts of the sum with constants \(\sigma _1\), \(\sigma _2\) and \(\sigma _3\).
4.3 Implementations
4.4 Evaluation
Prover | Training | Testing | \(\sum \) | \(\bigcup \) |
---|---|---|---|---|
leanCoP−def | 574 | 0 | 574 | 664 (\(+\mathbf{15}.7 \%\)) |
FEMaLeCoP−def | 550 (\(-\) 4.2%) | 90 | 640 (\(+\mathbf{11}.5 \%\)) | |
leanCoP\(+\)def | 568 | 0 | 568 | 623 (\(+9.7\%\)) |
FEMaLeCoP\(+\)def | 540 (\(-\) 4.9%) | 55 | 595 (\(+4.8\%\)) |
Prover | Training | Testing | \(\sum \) | \(\bigcup \) |
---|---|---|---|---|
leanCoP−def | 643 | 0 | 643 | 701 (\(+\mathbf{9}.0 \%\)) |
FEMaLeCoP−def | 607 (\(-\)5.6%) | 58 | 665 (\(+3.4\%\)) | |
leanCoP\(+\)def | 577 | 0 | 577 | 627 (\(+8.7\%\)) |
FEMaLeCoP\(+\)def | 542 (\(-\)6.1%) | 50 | 592 (\(+2.6\%\)) |
5 Monte Carlo Proof Search
5.1 Monte Carlo Tree Search
5.2 Child Selection Policy
5.3 Child Probability
-
The baseline probability assigns equal probability to all children, i.e. \(P\left( n' \mid n\right) \propto 1\).
-
The open branches probability steers proof search towards derivations with fewer open branches, by assigning to \(n'\) a probability inversely proportional to the number of open branches in \(n'\). Therefore, \(P(n' \mid n) \propto 1 / \left( 1 + |b_o(n')|\right) \), where \(b_o(n)\) returns the open branches in n.
-
The Naive Bayes probability attributes to \(n'\) a probability depending on the calculus rule applied to obtain \(n'\) from n. In case the extension rule was not used, the node obtains a constant probability. If the extension rule was used, the formula \({{\,\mathrm{\textsc {nb}}\,}}\) introduced in Sect. 4.2 is used, requiring contrapositive statistics from previous proofs. However, as \({{\,\mathrm{\textsc {nb}}\,}}\) does not return probabilities, we use it to rank contrapositives by the number of contrapositives with larger values of \({{\,\mathrm{\textsc {nb}}\,}}\):where \(\mathbf{f}(n)\) denotes the features of the derivation n. Then, we assign to nodes as probability the inverse of the Naive Bayes rank:$$\begin{aligned} {{\,\mathrm{\text {rank}}\,}}_{{{\,\mathrm{\textsc {nb}}\,}}}(n, c) = \left| \left\{ c' \mid n \xrightarrow {{{\,\mathrm{\text {ext}}\,}}(c')} n', {{\,\mathrm{\textsc {nb}}\,}}(c', \mathbf{f}(n)) \ge {{\,\mathrm{\textsc {nb}}\,}}(c, \mathbf{f}(n)) \right\} \right| , \end{aligned}$$$$\begin{aligned} P(n' \mid n) \propto {\left\{ \begin{array}{ll} 1 / {{\,\mathrm{\text {rank}}\,}}_{{{\,\mathrm{\textsc {nb}}\,}}}(n, c) &{} \quad \text {if }\quad n \xrightarrow {{{\,\mathrm{\text {ext}}\,}}(c)} n' \\ 1 &{} \quad \text {otherwise} \end{array}\right. } \end{aligned}$$
5.4 Reward
-
The branch ratio reward determines the reward to be the ratio of the number of closed branches and the total number of branches, i.e. \(\rho (n) = |b_c(n)| / |b(n)|\).
-
The branch weight reward is based on the idea that many open branches with large literals are indicators of a bad proof attempt. Here, the size |l| of a literal is measured by the number of symbol occurrences in l. Furthermore, the closer to the derivation root a literal appears, the more characteristic we consider it to be for the derivation. Therefore, the reward is the average of the inverse size of the branch leaves, where every leaf is weighted with the normalised depth of its branch.$$\begin{aligned} \rho (n) = \frac{1}{|b_o(n)|} \sum _{b \in b_o(n)} \frac{{{\,\mathrm{\text {norm}}\,}}({{\,\mathrm{\text {depth}}\,}}(b))}{|{{\,\mathrm{\text {leaf}}\,}}(b)|} \end{aligned}$$
-
The machine-learnt closability reward assumes that the success ratio of closing a branch in previous derivations can be used to estimate the probability that a branch can be closed in the current derivation. This needs the information about attempted branches in previous derivations, and which of these attempts were successful. We say that a literal l stemming from a clause c is attempted to be closed during proof search when l lies on some branch. The attempt is successful iff proof search manages to close all branches going through l. Given such data from previous proof searches, let p(l) and n(l) denote the number of attempts to close l that were successful and unsuccessful, respectively. We define the unclosability of a literal l as \(\frac{n(l)}{p(l) + n(l)}\). However, the less data we have about a literal, the less meaningful our statistics will be. To account for this, we introduce weighted unclosability: We assume that a literal that never appeared in previous proof searches is most likely closable, i.e. its weighted unclosability is 0. The more often a literal was attempted to be closed, the more its weighted unclosability should converge towards its (basic) unclosability. Therefore, we model the probability of l to be closable asFinally, the closability of a derivation is the mean closability of all leafs of open branches of the derivation, i.e. the final reward formula is$$\begin{aligned} P(l \, {{\,\mathrm{\text {closable}}\,}}) = 1 - {{\,\mathrm{\text {norm}}\,}}(p(l)+n(l)) \frac{n(l)}{p(l)+n(l)} \end{aligned}$$$$\begin{aligned} \rho (n) = \sum _{b \in b_o(n)} \frac{P({{\,\mathrm{\text {leaf}}\,}}(b)\, {{\,\mathrm{\text {closable}}\,}})}{|b_o(n)|} \end{aligned}$$
5.5 Expansion Policy
-
The default expansion policy adds \(n_1\), i.e. the simulation root, to the MC tree.
-
The minimal expansion policy picks \(n_e\) to be the smallest of the simulation nodes with respect to a given norm \(|\cdot |\), such that for all i, \(|n_e| \le |n_i|\). If multiple \(n_e\) are admissible, the one with the smallest index e is picked. We consider two norms on nodes:1.The first norm measures the number of open branches.2.The second norm measures the sum of depths of open branches.
5.6 Implementation
initTree
\(L\ { Path}\ \sigma \) creates an initial Monte Carlo tree for a connection proof search tree for the word \(\langle \{L\}, M, { Path} \rangle \) under the substitution \(\sigma \). Starting from this initial Monte Carlo tree T, mcps
T constructs a (potentially infinite) lazy list of Monte Carlo iterations, with T as its head, where an iteration consists of a Monte Carlo tree and possibly a proof discovered during the simulation performed in the iteration. Of this list, we consider T and the following maxIterations
elements: When maxIterations
is set to 0, only T is considered and thus proof search behaves like leanCoP. When maxIterations
is set to \(\infty \), the whole proof search is performed in the MCPS part. As MCPS is performed lazily, MCPS may be performed for less than maxIterations
iterations when it discovers some proof contributing to the final closed derivation. Here, the lazy list characterisation introduced in Sect. 3.4.2 turns out to permit a very concise implementation as well as an easy integration of techniques such as restricted backtracking. As soon as all proofs discovered during MCPS were considered (line 5), the tree T of the final Monte Carlo iteration last mc
is obtained and the children of the root of T are sorted by decreasing average T-descendant reward \(\overline{\rho } _T\) (line 6). Finally, the last applied proof step of each child is processed like in the lazy list implementation (lines 7–11).
5.7 Parameter Tuning
maxIterations
\(= \infty \). For any heuristic h not used in the base configuration, we replace the default heuristic with h and evaluate the resulting configuration. The results are shown in Table 7: The heuristics that most improve the base configuration are the machine-learnt closability reward and the minimal expansion policy 2.Configuration | Iterations | Sim. steps | Discr. | Solved |
---|---|---|---|---|
Base | 116.46 | 1389.82 | 1.37 | 332 |
Uniform probability | 949.62 | 17,539.59 | 1.31 | 237 |
NB probability | 528.39 | 8014.03 | 1.35 | 248 |
Random reward | 104.88 | 1167.98 | 1.19 | 364 |
Branch weight reward | 108.13 | 1268.88 | 1.12 | 334 |
ML closability reward | 108.52 | 1151.61 | 2.30 | 367 |
Default exp. pol. | 371.81 | 4793.58 | 1.38 | 328 |
Minimal exp. pol. 2 | 224.72 | 2769.12 | 1.40 | 348 |
maxIterations
performs best between 20 and 40, see Fig. 5a: Below 20, MCTS cannot provide any meaningful quality estimates, and above 40, the quality estimates do not significantly improve any more, while costing computational resources. The exploration constant \(C_p \approx 0.75\) gives best results, where the machine-learnt closability reward achieves a local optimum, see Fig. 5b: At such an optimum, exploration and exploitation combine each other best, therefore the existence of such an optimum is a sanity check for reward heuristics (which the branch ratio reward does not pass). The maximal simulation depth \(s_{\max } \approx 20\) seems to perform best, see Fig. 5c. Above this value, the number of solved problems decreases, since the number of actually performed simulation steps decreases, as shown in Fig. 5d. This might be explained by the fact that at higher simulation depths, the computational effort to calculate the set of possible steps increases, for example because the substitution contains more and larger elements.
5.8 Evaluation
Prover | Training | Testing | \(\sum \) | \(\bigcup \) |
---|---|---|---|---|
leanCoP/m | 563 | 0 | 563 | 653 (\(+\mathbf{16}.0 \%\)) |
monteCoP | 511 (\(-\) 9.2%) | 90 | 601 (\(+6.7\%\)) | |
leanCoP/F | 577 | 0 | 577 | 627 (+8.7%) |
FEMaLeCoP\(+\)def | 542 (\(-\) 6.1%) | 50 | 592 (\(+2.6\%\)) |