Neural Network Verification is a Programming Language Challenge
verfasst von
:
Lucas C. Cordeiro, Matthew L. Daggitt, Julien Girard-Satabin, Omri Isac, Taylor T. Johnson, Guy Katz, Ekaterina Komendantskaya, Augustin Lemesle, Edoardo Manino, Artjoms Šinkarovs, Haoze Wu
Das Kapitel taucht in die komplexe Welt der Verifikation neuronaler Netzwerke ein, ein Bereich, in dem die Grenzen zwischen maschinellem Lernen und Programmiersprachen verschwimmen. Es beginnt mit der Untersuchung der traditionellen Dichotomie zwischen datengesteuerter und algorithmengesteuerter Programmierung, wobei die Herausforderungen durch die mangelnde Robustheit neuronaler Netzwerke hervorgehoben werden. Der Text untersucht den Stand der Technik bei der Verifikation neuronaler Netzwerke, einschließlich feindlicher Schulungen und formaler Verifikationstechniken, und die Diskrepanzen, die zwischen maschinellem Lernen und Verifikationszielen auftreten. Es wird das Konzept der Einbettungslücke eingeführt, bei der die Semantik hochrangiger Daten bei der Übersetzung in niederrangige Darstellungen verloren geht, und die Implementierungslücke, bei der die verifizierten Eigenschaften eines neuronalen Netzwerks in seiner tatsächlichen Implementierung möglicherweise nicht halten. Das Kapitel diskutiert auch das Potenzial von abhängig typisierten Sprachen und formalen Schnittstellen zur Überbrückung dieser Lücken und bietet einen Einblick in die Zukunft der Verifikation neuronaler Netzwerke. Es schließt mit einer Roadmap für die zukünftige Forschung, in der die Notwendigkeit einer einheitlichen, ausdrucksstarken Sprache betont wird, die die gewünschten Eigenschaften neuronaler Netzwerke und ihrer Pipelines kodieren und formale Beweise für ihre Richtigkeit liefern kann.
KI-Generiert
Diese Zusammenfassung des Fachinhalts wurde mit Hilfe von KI generiert.
Abstract
Neural network verification is a new and rapidly developing field of research. So far, the main priority has been establishing efficient verification algorithms and tools, while proper support from the programming language perspective has been considered secondary or unimportant. Yet, there is mounting evidence that insights from the programming language community may make a difference in the future development of this domain. In this paper, we formulate neural network verification challenges as programming language challenges and suggest possible future solutions.
1 Introduction
Traditionally, statistical machine learning has distinguished its methods from “algorithm-driven” programming: the consensus has been that machine learning is deployed when there is example input-output data but no general algorithm for computing outputs from inputs. Thus, neural networks are commonly seen as programs that emerge from data via training, without direct human guidance on how to perform the computation. This unfortunate dichotomy has led to a divide between programming language and machine learning research that is still awaiting resolution.
The first hint that this dichotomy is not as fundamental as was thought came from the machine learning community itself. The famous paper by Szegedy et al. [112] pointed out the “intriguing” problem that even the most accurate neural networks fail to satisfy the property of robustness, i.e. small perturbations of their inputs should result in small changes to their output. Szegedy’s key example concerned imperceptible perturbations of pixels in an image that can sway the neural network’s classification decisions. This lack of robustness can have safety and security implications: for example, an autonomous car’s vision unit may fail to recognise pedestrians on the road. For that reason, the problem attracted significant attention [25] but remains unresolved to this day. Partial solutions often deploy methods of adversarial training — i.e., training based on computing adversarial attacks — which augment the training set with the worst-case perturbations of the input data points with respect to the output loss of the neural network [78].
Anzeige
The robustness of neural networks actually yields a formal specification [26]. Given a neural network \(f: \mathbb {R}^m \rightarrow \mathbb {R}^n\), f is robust around\(\hat{x} \in \mathbb {R}^m\), if
where \(\epsilon , \delta \in \mathbb {R}\) are small constants and \(\Vert .\Vert \) computes a vector distance. From the programming language perspective, robustness can be seen as a refinement type that refines input and output types of f, cf. [77]. At the same time, robustness is an example of a desirable property that neural networks cannot learn from data alone: note the quantification over vectors x that do not belong to the data set. This challenges the classical dichotomy between algorithm-driven and data-driven programming, demonstrating the inevitability of property specification in both cases.
Fig. 1.
Schematic representation of the state of the art in training and verifying neural networks for properties. Solid lines denote methods widely accepted by the research communities, dashed lines mean “some experimental prototypes exist”, dotted arrows mean the connection is desired but not established.
Against this background, both the machine learning and verification communities proposed several useful methods of training for, or respectively verifying, certain properties1. Fig. 1 depicts these two groups of methods as two parallel pipelines. At the top, we include all adversarial training methods [78] that were generalised to account for arbitrary optimisation objectives, given a property informally expressed in (a fragment of) first-order logic [46, 52]. At the bottom, we include the verification pipeline which is supported by more than a dozen neural network verifiers, such as Marabou [75, 120], \(\alpha \beta \)-CROWN [119], PyRAT [51], to name but a few. Unlike the machine learning approaches, it features a formal language for property specification, VNN-LIB . Furthermore, an annual competition VNN-COMP develops common standards for this domain [19, 20].
However, there are several fundamental problems that prevent these emerging ideas from developing to full fruition. Firstly, both the machine learning and verification communities assume that in theory a neural network can be optimised for the desirable verification property. However, without any programming language support to ensure this formally, discrepancies between machine learning objectives and verification objectives have been found in the literature, even for simple robustness properties [26]. In Fig. 1, this problem is depicted by distinguishing the two versions of NN Property and NN Property\(^*\) and a dotted line between them. The desirable solution is to have a single language with the relevant specification, which is then compiled down to either verification or machine learning backends.
Similarly, discrepancies have been reported between different representations of neural networks [71], e.g., using real numbers in verification and floating point numbers in training. In Fig. 1, this problem is depicted by showing two potentially disagreeing implementations, Implementation and Implementation\(^*\). Ideally, we should be able to verify the actual programs, and not their idealised descriptions. Or, as an equally acceptable alternative, the solid arrow between two implementations in Fig. 1 should be reversed in the other direction – ensure that the guarantees concerning the verified neural networks extend to their actual implementations, thus establishing the connection along the bottom dotted arrow in Fig. 1.
Anzeige
Finally, neural networks are rarely implemented as stand-alone programs. More often, they are embedded into larger system development that, in turn, may have its own specification and verification regimes. Although the idea of a verified neural network controller is not itself new to the cyber-physical system research (cf. § 3.7), the programming language support for verification of such systems is a nascent field [89, 115].
In this light, we believe it is time to discuss how the verification and synthesis of safe neural networks fit together with general programming practices. In this “Fresh Perspectives” paper, we give an overview of the current state of the art in implementing neural network verification and explain the challenges the neural network verification community currently faces (Sec. 3). We do so by tracing different parts of the diagram in Fig. 1, and explaining the nature of the discrepancies in its different parts, from the programming language point of view. We wrap up this paper by suggesting possible ways the programming language community can help improve the state of the art (Sec. 4).
2 Neural Network Verification Properties
The problem of defining verification properties for neural networks has received substantial attention. Verification approaches started with neural networks deployed as controllers in autonomous systems [74, 102]. With time, they were generalised to cover data-dependent verification properties such as robustness [42, 43, 64, 122]. A set of standard benchmarks is revised and updated annually at the VNN-COMP; the competition reports [18, 21] provide a thorough overview of them.
Neural network verification properties can be divided into three categories.
1.
Geometric properties. These properties are based on the geometry of the data manifold without any appeal to its possible semantic meaning. One such property is (local) robustness, whose definition is given in Equation 1 (see also the additional examples in [27]). Another related property is (local) equivalence [114], which constrains the output of two different networks to be similar under the same input, either in absolute value (\(\epsilon \)-equivalence) or class prediction (top-k equivalence).
2.
Hyper-properties. These properties require guarantees for any input, rather than just those close to the data manifold. Classic examples are global robustness [101] and global equivalence [82]. A more recent example of such properties is confidence-based robustness [8], which allows for some non-robust behaviour, but only for inputs close to the decision boundary. The latter complicates the specification and verification process in interesting ways (see Sec. 4).
3.
Domain-specific properties. These properties are based on the presumed semantics of the data on which the neural network is trained. Usually, they take the form of admissible intervals on the input and output vector values.
The ACAS Xu challenge (the oldest neural network verification benchmark) best illustrates this third class of properties. It takes a neural network that models an aircraft controller: based on five input measurements between the own ship and an intruder (distance, angles, relative speeds), the neural network outputs one of five advisory actions (strong/weak left or right, clear of conflict).
When the benchmark was introduced in [74], nine properties were formulated by the engineers who designed the collision avoidance software. For instance, Property 3 states that if the intruder is directly ahead and is moving towards the own ship, the network will not advise clear of conflict. When written in the VNN-LIB query language [35] (see Sec. 3.1), the property is translated to real-valued intervals on the five input measurements and a constraint on the output prediction.
3 Neural Network Verification: State of the Art
In this section, we describe the state of the art in neural network verification, from the perspective of the existing programming language support, rather than the existing verification algorithms. For the latter, the tutorial [4] is available. We will proceed by tracing different arrows of Figure 1 and explaining the existing discrepancies and solutions.
3.1 Verification pipeline
Fig. 2.
Schematic representation of the neural network verification pipeline.
Neural Network Verification Problem. Let us start with describing the common verification pipeline illustrated in Fig. 2. Given a trained neural network \(f :\mathbb {R}^m \rightarrow \mathbb {R}^n\) and some network property \(\varXi \), the Neural Network Verification Problem is the problem of deciding whether \(\varXi (f)\) holds. Current verifiers assume using a special format — ONNX (standing for Open Neural Network Exchange) [1] — to represent the neural networks. Thus, in reality, we verify \(\varXi (f^*)\), where \(f^*\) is obtained from f by ONNX translation.
The verifiers typically consider properties defining a precondition on the network inputs and a postcondition on its outputs. Both conditions are most commonly linear (e.g., defined using linear bounds) and represent safe regions. Formally, let \(\varXi {:}{-}\langle P, Q \rangle \) where \(P:\mathbb {R}^m\rightarrow \{\top ,\bot \}\) and \(Q:\mathbb {R}^n\rightarrow \{\top ,\bot \}\). The neural network verification problem is then deciding whether \(\forall x\in \mathbb {R}^m: P(x)\Rightarrow Q(f^*(x))\). Neural network verification algorithms then attempt to find a counterexample (i.e., \(x\in \mathbb {R}^m\) such that \(P(x)\wedge \lnot Q(f^*(x))\)) or conclude there is none. Several neural network verifiers are currently available to solve such verification problems: e.g. Marabou [75, 120], \(\alpha \beta \)-CROWN [119], PyRAT [51], NNV [85, 117] and ERAN [110]. Since 2020, an annual International Verification of Neural Networks Competition (VNN-COMP) has been held, and has played an important role in consolidating the new research community and developing standards for this domain [19, 20].
Mainstream specification languages. Most neural network verifiers have a basic query language for representing individual queries. These formats are invariably simple enough so that the type-system is implicit rather than explicit and they possess no capability to abstract over definitions. The de-facto standard is the VNN-LIB query language [35] which is used in VNN-COMP [11]. The language is a subset of the QFLRA fragment of the SMT-LIB language, an S-expression based language widely used in the SMT verification community as a standard input for SMT provers [13]. The goal of VNN-LIB is to model first-order logic properties on the inputs and outputs of neural networks. Fig. 3 illustrates a snippet of robustness specification written in VNN-LIB . As can be seen, VNN-LIB specification itself does not explicitly talk about the functions f or \(f^*\), rather it is assuming that the property will be used to verify the function \(f^*\) provided in a separate ONNX file. Thus, VNN-LIB and ONNX together serve as a specification for \(\varXi (f^*)\).
Fig. 3.
Snippet of robustness specification in VNN-Lib for an image data set that has input of dimension 792 and 10 classes. The specification assumes an external definition of\(f^*: \mathbb {R}^{792} \rightarrow \mathbb {R}^{10}\).
From a programming language perspective, there are several issues with the VNN-LIB format as a language for expressing specifications.
1.
Lack of expressivity. VNN-LIB and ONNX are simply not expressive enough to represent all the specifications users want to write. For example, the VNN-LIB and ONNX formats can only refer to a single neural network at a time, which makes encoding specifications where one needs to express properties on several neural networks at once impossible. Similarly, hyperproperties [8, 28] cannot be specified in VNN-LIB without special tooling, and neither can properties involving hidden neurons. Finally, VNN-LIB only supports satisfaction queries, meaning the specification writer must manually negate universal queries before being encoded.
2.
Lack of conciseness. The lack of abstraction and the limitation that variables cannot represent multi-dimensional tensors means that more complex properties cannot be represented concisely. Consequently, the length of the queries tend to scale with the dimensions of inputs and outputs of the network, even when the property can be expressed concisely in mathematics in constant space. For example, the full specification in Fig. 3 that encodes the single line of Eq. 1 is a couple of thousand lines long.
3.
Lack of rigour. VNN-LIB does not have a formally defined semantics, nor does it even formally define its own syntax. Consequently, it is difficult for users to check whether their specification in VNN-LIB is correct or compliant, and impossible to prove the soundness of tools that either consume or generate VNN-LIB. Furthermore, the ONNX format that VNN-LIB relies on, also lacks a formal semantics. For example, the ONNX documentation for the convolution operator2 has no proper mathematical specification for the semantics of the operator, describing it only with the single sentence “The convolution operator consumes an input tensor and a filter, and computes the output”. Other ONNX operator descriptions like those of Convolution, Maxpool, or Add (for broadcasting) refer to external sources like Numpy, PyTorch or Tensorflow for more implementation details.
4.
Lack of dynamic bindings to datasets. Crucial to most attempts to specify “correctness” of a neural network is the notion of the data manifold, i.e., the distribution of inputs that the neural network will actually encounter during operation. Usually, the data manifold is only a small subset of the actual input space. By definition, the network should never encounter inputs that lie off the data-manifold during normal operation. If it does, there is no reason to require any particular behaviour from the network, and consequently, specifications should only quantify over inputs that lie on the manifold. The problem is that, in most cases, there is no precise mathematical definition of the data manifold. Therefore, the most common approach is for the specification to approximate the manifold as the union of “small” regions around each input in the training dataset. Unfortunately, the training datasets themselves are frequently huge, anywhere from thousands to hundreds of millions of items. Therefore, it is infeasible to directly express the dataset in the specification.
This lack of rigour of the underlying specification format has been recognised as a major problem. A recent effort in the ONNX community has led to the creation of a ONNX Safety-Related Profile working group3 which aims to elaborate a dedicated ONNX profile for safety-related systems. While still embryonic, this working group might answer some of the issues highlighted above.
To work around the remaining problems, the natural solution is to allow users to represent their specifications in a higher-level specification language, connecting the neural network specification to the language of the larger system in which it is embedded. Moreover, the specification language must provide some mechanism for dynamically binding variables to existing datasets in standard formats used by machine learning practitioners.
3.2 Prototypes of New Specification Languages
Fig. 4.
An extract from a local robustness specification in CAISAR and Vehicle’s input languages for the same image dataset described in Fig.3. Note the ability to reuse predicates and definitions, the conciseness of vector-based operations, and the explicit data set bindings.
In response to the outlined problems, two major attempts have been made to design more principled specification languages for neural network verification. We outline the essence of both, in turn. Fig. 4 provides code snippets for illustration.
1.
CAISAR. The CAISAR platform [51] incorporates a higher-level specification language deriving from WhyML [45]. WhyML is a typed first-order language with pattern-matching, polymorphism, and a module system. On top of that, CAISAR provides additional types of linear algebra structures common in machine learning and compiles the specification back to plain WhyML. Writing a compiler from WhyML to VNN-LIB is straightforward, allowing CAISAR to target all state-of-the-art solvers from one single specification. It can also deal with specifications involving multiple neural networks and dynamically bind variables to concrete datasets. However, it can be argued that the composability of WhyML is limited, and the lack of dependent types prevents the modelling of important properties (for instance, encoding the dimension of inputs directly in their types could prevent common runtime errors).
2.
Vehicle. The Vehicle specification language [31‐33] is a higher-order and dependently-typed functional language. The language aims to be able to express a full range of specifications and to that end it contains quantifiers as first-class language constructs, conditionals and higher-order functions over tensors such as maps and folds. The language’s dependently typed nature allows the user to encode richer properties and includes tensor size constraints that can be checked before verification by the type-checker. Vehicle also has a backend that allows connecting proofs of neural network properties to larger system specifications in Agda [32]. However, unlike CAISAR , it connects to far fewer tools and cannot allow multiple solvers to work together.
These two languages solve the problems outlined in Sec. 3.1 and provide a concrete implementation. Note, in particular, that both manage data set bindings, neural network bindings, and data validity checks in clear, explicit ways. By doing this, they are essentially building the specification languages on top of the existing pipelines: in Figs. 1 and 2, this is depicted by a dashed “Specification” box towards the left side. Other specification languages exist, like NeSAL [123] (which has no implementation) or DNNP [107] (lacking quantifiers and strong typing).
3.3 The Embedding Gap
We now consider the influence of larger system verification on the neural network verification pipeline (see Fig. 5). Consider a purely symbolic program \(s(\cdot )\), whose completion requires computing a complex, unknown function \(\mathcal {H}: \mathcal {P}\rightarrow \mathcal {R}\) that maps objects in the problem input space\(\mathcal {P}\) to those in the problem output space\(\mathcal {R}\). Given an embedding function \(e : \mathcal {P}\rightarrow \mathbb {R}^m\) and an unembedding function \(u : \mathbb {R}^n \rightarrow \mathcal {R}\), we can approximate \(\mathcal {H}\) by training a neural network \({f : \mathbb {R}^m \rightarrow \mathbb {R}^n}\) such that \(u \circ f \circ e \approx \mathcal {H}\). We refer to \(u \circ f \circ e\) as the solution, and refer to \(\mathbb {R}^m\) and \(\mathbb {R}^n\) as the embedding input space and embedding output space respectively. Unlike objects in the problem space, the vectors in the embedding space are often not directly interpretable. The complete program is then modelled as \(s(u \circ f \circ e)\). Examples of u and e would be the normalization of inputs, resizing operations for images, or data augmentation operations that are commonplace in machine learning pipelines.
Our end goal is to prove that \(s(u \circ f \circ e)\) satisfies a property \(\varPsi \), which we will refer to as the program property. The natural way to proceed is to establish a solution property\(\varPhi \) and a network property\(\varXi \) such that the proof of \(\varPsi \) is decomposable into the following three lemmas:
$$\begin{aligned} \varXi (f) \end{aligned}$$
(2)
$$\begin{aligned} \forall g : \varXi (g) \Rightarrow \varPhi (u \circ g \circ e) \end{aligned}$$
(3)
$$\begin{aligned} \forall h : \varPhi (h) \Rightarrow \varPsi (s(h)) \end{aligned}$$
(4)
i.e. Lemma 2 proves that the network f obeys the network property \(\varXi \), then Lemma 3 proves that this implies \(u \circ f \circ e\), the neural network lifted to the problem space, obeys the solution property \(\varPhi \), and finally Lemma 4 proves that this implies \(s(u \circ f \circ e)\), the neuro-symbolic program, obeys the program property \(\varPsi \).
Fig. 5.
Schematic representation of the embedding gap.
The first issue that we run into is what we call the embedding gap. In \(\varPsi \), users would like to be able to model data that potentially has non-trivial semantics (for example, featuring both continuous and discrete parameters of a cyber-physical system such as velocity, stopping distance, switches etc.). However, in \(\varXi \), all values must be represented as continuous real vectors (in actuality, at the training phase, floating-point vectors, cf. Sec. 3.4). A function from the latter to the former must be highly non-surjective.
For example, consider an input type with two values, ‘Yes’ and ‘No’, encoded as real values ‘0.0’ and ‘1.0’ correspondingly. In the low-level query, one can encode that this input variable can only take two possible values using a disjunctive constraint (\(x = 0.0 \vee x = 1.0\)), but this does not scale well as the number of constructors in the data type grows, as each disjunction drastically increases the cost of verification. Instead, the most common current solution is to encode this as a single non-disjunctive constraint, \(0.0 \le x \le 1\). In this case, the problem is that floating-point numbers may contain other values (e.g., ‘0.005’, ‘0.97’), which are meaningless in the chosen domain.
Fig. 6.
Outline of Vehicle compiler backends, bridging the Embedding Gap [32, 33]. Dashed lines indicate information flow and solid lines automatic compilation.
More generally, if users are to express specifications in \(\varPsi \), the high-level specification language must also allow users to specify the embedding and unembedding functions, e and u, as part of the specification. It should then be the responsibility of the compiler to generate suitable low-level queries representing \(\varXi \). However, allowing the user to encode their specifications at the high-level \(\varPhi \) requires that the specification language compiler must be able to automatically translate from the former to the latter. The only existing attempt to provide programming language support for this was made by Vehicle [31‐33] as shown in Fig. 6. In particular, Vehicle proposes a specification language to express \(\varPhi \), e, u, and can compile the specification to Agda, in which more general properties of \(s(\cdot )\) can be defined.
3.4 The Implementation Gap
In Sec. 3.1 we considered \(\varXi (f^*)\), where \(f^*\) was an ONNX object, possibly obtained by conversion from the original implementation of f. The ONNX format has no backward translation from \(f^*\) to f, as the diagram in Fig. 7 shows. However, in the majority of neural network verification publications, authors implicitly assume that obtained verification guarantees about \(f^*\) extend to f. In this section, we outline a range of problems caused by this and thus trace the right-most section of the diagram illustrated in Figs. 1 and 7.
Poor support for neural architecture conversion to ONNX . ONNX re-implementation of original neural networks remains a largely manual and un-verified procedure, which may be a source of errors. For example, neural networks contain different types of linear (e.g., fully connected, convolutional) and non-linear (e.g., ReLU, sigmoid, MaxPool) connections. Supporting the formal analysis of a new type of connection typically requires tool developers to add a new dedicated module to the codebase. For example, in verifiers based on abstract interpretation [110], this process would involve implementing the abstract transformer for the new type of connection. In SMT-based verification procedures [74, 121], the developer would need to implement the encoding, simplification, and satisfiability checking of constraints corresponding to the new connection. This process is tedious, repetitive, and error-prone. For example, the verification code for two-phase activation functions such as ReLU, Leaky ReLU, and absolute values is very similar, yet developers typically need to hard-code separate verification modules for each of these connections. Ideally, there should be automated conversion procedures with correctness guarantees.
Fig. 7.
Schematic representation of the implementation gap.
Mismatch in numerical types. Barring experimental architectures that rely on analog computing [124], most implementations of neural networks are based on digital platforms that operate with finite-precision types such as integer and floating-point numbers. Effective conversion between real-valued types and finite-precision ones is an active research direction in machine learning [49].
The most ubiquitous numerical type in machine learning is the floating-point number [14, 81]. Indeed, the IEEE 754 single precision (32-bit) floating point type [65] is the de facto standard of libraries such as Tensorflow4 and Pytorch5. Efforts to improve over the IEEE 754 standard exist, but they are often relegated to the context of hardware accelerators, where reducing the bit-size of numerical types may yield significant gains in terms of speed, memory and power consumption [23, 118].
From the verification perspective, it is well known that the safety certificates produced by real-valued neural network verifiers do not hold for floating-point implementations [71, 129]. Indeed, Jia and Rinard [71] propose an algorithm to search for floating-point counterexamples to real-valued safety certificates, thus invalidating them. Similarly, Zombori et al. [129] construct neural networks that contain undetectable backdoors, as long as the effects of numerical precision are neglected. Furthermore, the counterexamples produced by real-valued verifiers may not exist on a floating-point implementation of the same neural network, a phenomenon that has been reported on some VNN-COMP benchmarks [91].
Other sources of non-determinism. The current machine learning workflow, from training to inference, is not reproducible across different hardware and software platforms [29, 100, 105, 128]. This is due to a variety of reasons:
1.
Non-associativity of floating-point. It is well-known that floating-point operations are not associative, i.e., \(a+(b+c)\ne (a+b)+c\). As such, we can only verify the behaviour of a floating-point neural network if we know the order of all its operations. The existing de-facto standard ONNX does not include such a level of detail.
2.
Parallel execution. Inference and training of neural networks are often sped up via parallel execution. Whether this is done via SIMD operations, multi-core CPUs, or GPU parallelism, it always introduces non-determinism in the results [100, 105].
3.
Auto-selection of primitives. Modern machine learning compilers like XLA6 automatically select the most efficient algorithms depending on the computational load [100]. While PyTorch or Keras present ways to fix the behaviour of the algorithm , the ONNX runtime does not. For instance, Schögl et al. [105] report non-deterministic behaviour in the selection of convolutional algorithms on GPUs, which may alternate between explicit loop, GEMM-based, Winograd and FFT implementations.
4.
Runtime optimisations. Machine learning frameworks may also implement runtime optimisation modifying the structure of the model itself to speed up inference or reduce memory usage, for example by fusing layers together (e.g. convolution and batch normalisation).
5.
Non-deterministic training. The learning process itself is highly non-deterministic. Common sources include: parameter initialisation, data augmentation strategy, batch ordering, and dropout layers [100].
6.
Mathematical library rounding. A long-standing issue in floating-point computation is incorrect roundings in the standard mathematical library math.h. Technically, the IEEE 754 standard recommends correct roundings [65], and there are efforts to create open-source implementations of math.h that abide by it [108]. However, mainstream compilers instead implement a variety of approximately-rounded algorithms [17].
7.
Low-level implementation details. Furthermore, derived operators such as Softmax may leverage the fact that \(\text {softmax}(x + c) = \text {softmax}(x)\) with constant c to increase the precision and avoid overflows. Such details can only be found in the low-level source code, even though they severely affect the precision of the computation.
Overall, the end-to-end effects of the above causes of non-determinism cannot be neglected. Indeed, Pham et al. [100] reports a \(2.9\%\) difference in accuracy while reproducing the same training run on different platforms. Similarly, Cidon et al. [29] reports a \(6\%\) difference in accuracy when considering the whole image recognition pipeline, including camera noise and image processing algorithms.
From the verification perspective, certifying the safety of neural network implementations requires a different approach than high-level neural network verifiers like Marabou [120] or \(\alpha \beta \)-CROWN [119]. Indeed, if we had access to the low-level implementation of every library in the machine learning pipeline, we could employ software verifiers [15] for this purpose. Unfortunately, existing software verifiers struggle to cope with the scale and complexity of neural network code [88, 91, 92]. In contrast, automated testing approaches are currently more effective [36, 56, 98], but cannot prove correctness.
Quantised neural networks. Switching to integer types (uniform quantisation) [49] can help alleviate some of the above problems (e.g. non-associativity of floating point, incorrect rounding) and improve reproducibility. From the machine learning perspective, a good quantisation scheme maintains the accuracy of the original floating-point neural network. Usually, 8-bit integers are used, but more aggressive quantisation schemes exist, down to ternary [58] and binary representations [103].
From the verification perspective, integer and binary data types require fundamentally different representations than the real-valued types used by mainstream verifiers such as Marabou [120] and \(\alpha \beta \)-CROWN [119]. Existing work on verifying quantised neural networks relies on either the bit-vector SMT theory [12, 50, 59] or (mixed) integer linear programming (ILP, MILP) [63, 82, 94, 126]. In contrast, verifying the robustness of some binarised neural network architectures can be encoded as a satisfiability (SAT) problem [70, 97]. Other binarised architectures can still be encoded as a real-valued verification instance [6].
3.5 Reliable Proof Production and Checking
To overcome some of the challenges raised above, neural network verifiers may accompany their results with proof certificates, attesting their soundness using an external and relatively simple checking program (the proof checker) [66]. Since neural network verifiers are complex software, optimised for performance and speed, their verification is commonly intractable. Thus, proof production replaces the need to directly verify the neural network verifiers, with the need to verify only the proof checker. When a safety property is violated, neural network verifiers often accompany their results with a counterexample, which can then be checked by its evaluation in the network. However, proving safety (i.e., absence of violation of the property) is not straightforward, as the DNN verification problem is NP-complete even for simple networks [74, 104]. Therefore, proving safety is a greater challenge than proving a violation, and thus requires a more complicated proof and, consequently, a more complicated proof checker.
Proof production mechanism, supporting several piecewise-linear activation functions, was implemented on top of Marabou [66, 120]. The proofs produced by Marabou are checked by a proof checker implemented within Marabou. The Marabou proof checker is implemented in C++ and uses floating points arithmetic for its computations.
When using an external proof checker, the reliability of the neural network verifier is dependent on the reliability of the proof checker. Therefore, the proof checker is expected to meet higher standards of reliability, ideally provable soundness. Functional programming languages allow the implementation of a precise checker and formal verification of its soundness. For example, a simply-typed language Imandra was deployed to check proofs produced by Marabou [37, 38]. This work also shows that computations with precise real arithmetic come at a price of limited performance. This opens up the possibility for a variety of implementations of the same checking algorithm in different programming languages, exploring the trade-off between precision and performance speed.
3.6 Property-Guided Training
Fig. 8.
Schematic representation of the neural network training pipeline.
Finally, we give a brief outline of the state-of-the-art in the property-guided training, which occupies the upper section of Fig. 8. This is a booming area in its own right, also known under the umbrella term of neuro-symbolic AI. By pointing out existing programming language discrepancies and solutions, we do not attempt to give a full survey of neuro-symbolic AI, but refer the reader to more comprehensive surveys [53, 60].
In the introduction, we have already outlined the evolution from adversarial training (seen as training for the robustness property specifically) into a more general property-driven training (for any property of choice) [46, 47, 111]. It is noteworthy that, although robust training by projected gradient descent [54, 79, 87] predates verification, contemporary approaches are often related to, or derived from, the corresponding verification methods by optimizing verification-inspired regularization terms.
The weakest form of property-based training boils down to translating a specification written in a subset of first-order logic into a loss function, that serves directly as an optimisation objective within the implementation of a training algorithm. Thus, the training algorithm optimises the neural network to satisfy the desired property. This translation method is known under the name of differentiable logic (or DL) [46, 47, 111]. Vehicle implements DL as one of its backends [32] (cf. Fig. 6) and serves as a prototype of a compiler for neural network property specification languages (cf. Fig. 1). Recently, this inspired attempts at formalising different DLs in Coq [2].
There are other forms of training for robustness that come with stronger guarantees than DLs, e.g. IBP training [55, 125] and certified training [96, 127]. However, these usually have limited capacity for property specification; investigation of how these methods may fit into larger verification pipelines is warranted.
3.7 Other Directions
Verification of Cyber-Physical Systems. When following the diagram of Fig. 1, we did not impose any assumptions on the nature of properties we wish to ensure. In particular, we did not specify whether the “System” needs to be a cyber-physical system (CPS). However, CPS with machine learning components is an important safety-critical use case for neural network verification.
For example, a neural network may be utilized as a feedback controller for some plant model, typically represented as ordinary differential equations (ODEs) or generalizations thereof like hybrid automata. These are known as neural networks control systems (NNCS). The introduction of constraints to describe the dynamics of a CPS requires revisiting several blocks of Fig. 1. Specifically, we need to replace the purely symbolic specifications and algorithms with those allowing for continuous variables and differential equations.
The annual International Competition on Verifying Continuous and Hybrid Systems (ARCH-COMP) has a category for this problem class, known as the AI and NNCS (AINNCS) category [72, 73, 83, 84, 86]. Several approaches for addressing the NNCS verification problem have been developed, such as implemented within software tools like CORA [76], JuliaReach [16], NNV [85, 117], OVERT [109], POLAR [61], Sherlock [40, 41], ReachNN* [44, 62], VenMAS [3], and Verisig [67‐69].
More broadly, researchers have considered several strategies for the specification of properties of CPS with neural network components [7, 24, 48]. These cover significant challenges in the CPS domain, ranging from classical software verification problems to real-time systems concerns, scalability, as well as finding suitable specifications [89, 106, 115, 116]. Similarly to the standard neural network verification pipeline of Fig. 1, this area would benefit from a more principled programming language support.
Formal Specification of Probabilistic Properties. Program synthesis techniques can be valuable allies in producing correct-by-construction software and systems. In particular, the synthesis of logical formulas from a neural network and a dataset (e.g., via Inductive Logical Programming) received long-timed interest [99]. Also orthogonal to our work is Probabilistic Programming (as seen in [90]), which aims to provide a language and toolchain to express probabilistic properties of programs. It is clear that neural networks – seen as programs – would benefit from a probabilistic specification language. An early example in this direction is the ProbCompCert [113] project.
Formalisation of Machine Learning. So far, research on formalisation of neural networks or optimisation algorithms has developed in isolation from the mainstream neural network verification pipelines summarised in Fig. 1. However, these two lines of research are bound to meet one day. Relevant work in formalisation of machine learning includes: verification of neural networks in Isabelle/HOL [22] and Imandra [39]; formalisation of piecewise affine activation functions in Coq [5]; providing formal guarantees of the degree to which the trained neural network will generalise to new data in Coq [10]; convergence, in this case of a single-layered perceptron in Coq [95]; verification of neural archetypes in Coq [34]. The two approaches that came the closest to unifying formalisation and verification in neural network domain are the Vehicle formalisation in Agda [9] and the formalisation of differentiable logics in Coq [2], relation of the latter to the verification pipeline of Fig. 1 is discussed in Sec. 3.6.
4 The Future Roadmap
Table 1.
Examples of existing solutions and the PL challenges they (partially) address. For the sake of variety, we include the existing solutions in four distinct categories:high-level neural network verification DSLs(Vehicle, CAISAR); best-performing (according to VNNCOMP)low-level neural network verifiers(\(\alpha \beta \)-Crown, Marabou);formalisation and synthesis of quantised neural networkswith mainstream MILP solvers (QEBVerif, Aster); and the use ofgeneral-purpose software verifiers(CBMC, ESBMC) in neural-network verification. For the latter, when we mark proof certificate production as\(\checkmark ^*\), we refer to the generic proof production available for those verifiers, as opposed to the production of theFarkas witnessfor neural networkUNSATproblems that is available in Marabou.
Dummy
The previous section identified five desirable programming language features that neural network verification could benefit from: rigorous semantics, support for handling the embedding and implementation gaps, generation of proof certificates, and rigorous integration of property-driven training into verification pipelines. Currently, no single neural network verification tool or framework possesses all five features (see Table 1). Moreover, some tools considered leaders in the neural network verification market do not satisfy any. The title of this paper reflects our belief that the desirable solution – global specifications that formally explain the expected properties and yield a formal proof that the given implementation respects the specification – is a challenge in programming language design. In this section, we overview a couple of possible directions that may play a role in future solutions.
4.1 A Unified Dependently Typed Language
We believe the idealised solution to be a single language that is expressive enough to implement the machine learning pipeline and, at the same time, encode the desired properties of both the neural networks created by the pipeline and the pipeline itself. The following are a non-exhaustive list of the types of properties that should be representable:
1.
theoretical results about the convergence of the training process;
2.
correctness of tensor operations that underlie the training;
3.
rich properties of the input data, e.g., constraining inputs to a certain range;
4.
relation between input data and the weights in the network produced, e.g., robustness;
5.
properties of the floating point numbers being used.
Given the complexity of encoding some of these properties (e.g. numerical stability, robustness), we believe that the expressive power of dependent types is a natural fit. We now briefly argue how such a unified dependent-typed language would allow us to make progress towards the challenges outlined in Table 1.
1.
Rigorous semantics. The meta-theory of dependent types is well studied [30] so defining rigorous semantics for the language should not be a significant challenge. Furthermore, by implementing all the components in a single language, the friction of aligning the semantics of the different components in the system is significantly reduced.
2.
Embedding gap. In a dependently typed language, from one perspective there would be no embedding gap, as any representation changes must be stated explicitly as type conversions. However, from another perspective, working in such a language does not address the fundamental problem of translating the proofs from the problem space to the embedding space. It merely moves the work from the external proofs into the type-conversion functions. Nevertheless, the expressive power of dependent types is more than sufficient to implement the partial solutions proposed by Vehicle (Sec. 3.3).
3.
Implementation gap. The implementation gap (Sec. 3.4) will be resolved, as implementations will respect their types. For example, consider numerical types. If our specification assumes infinite reals, there is no way to instantiate an implementation that uses machine floats, as we will not be able to prove that machine floats is a valid representation of reals. If our specification is weak and we do not require properties of the operations or other equalities, then machine floats may be a valid data type for the chosen constraints. If we envision implementation to operate on machine floats, we can describe all the properties of interest (e.g., lack of associativity) in the data type. We must understand that we cannot use external libraries such as XLA or OpenBLAS without verifying them, as this will break all the formal guarantees in our specification. Either these libraries have to be verified formally or we can synthesise the code with similar runtime properties directly from our specification in a type-preserving way.
4.
Proof certificates. In a dependently-typed language, where the specifications are encoded as types, there is no need for separate proof certificates or proof checkers. In particular, the terms themselves act as the proof certificates and the type-checker acts as the proof checker.
5.
Supports training. Although at the moment training is carried out in non-dependently typed languages, there is nothing stopping training (e.g. automatic differentiation or similar algorithms) from being implemented in dependently-typed language. Not only would such an implemention significantly reduce the friction between training and verification, it would also facilitate the integrating property-driven training and verification by viewing it as code synthesis problem. The key idea here is that generating code from a formal specification is much easier than checking whether the given code respects the specification.
4.2 Formal Interfaces
However, we are not naive as to the difficulty of implementating such a unified framework. Firstly, it will be an uphill battle to overcome the significant first-mover advantage of existing tools in their respective domains, e.g. training frameworks in Python, C and others and neural network verifiers. Even leaving that aside, work has shown that checking such complex type-based specifications in an efficient manner is still a challenging problem (e.g. Kokke et al. [77]).
Therefore, we believe a more realistic short-term goal is to keep the overarching maximally expressive specification language, but design a compiler that can utilise existing tools to achieve certain subgoals. In particular, we should follow the approach of industry, where the use of many disparate systems is common. In such an environment, the designers of these individual components should not only rigorously pre-define their interfaces, but also provide full formal specifications about their behaviour and, ideally, provide proof certificates that the output satisfies the specification as part of the interface. This would allow the compiler to specify the expected behaviour of a given module and let the programmer choose the best implementation (provided it respects the specification) at that abstraction level.
One possible inspiration for the design of such interfaces could come from the rich literature of behavioural interface specification language (BISL) [57]. A BISL is a family of languages used to specify the expected behaviour of a program at the function level, providing a fine-grained level of control on how to precisely describe the function. BISLs usually follow a Hoare triplet-inspired formalism: the programmer should specify the precondition and the control flow; automated provers using weakest-precondition calculus or SAT can then automatically derive preconditions. Drawing from well-known languages like SPARK and Eiffel, it would then become easy to specify invariants on several functions at once. Such properties could then be translated back to the original program language (and then checked with the type system) or - if it results in a program that is impossible to represent - checked using external provers. This approach would be representative of what is being done in the industry for critical systems for decades, with JML, Why3 [45] or SPARK.
5 Conclusion
We have given some support for our main thesis – that neural network verification is increasingly becoming a programming language challenge. We hope this paper will provoke a stimulating discussion of this topic, helping the programming language community explore the opportunities presented by this new domain. Although we have supported our arguments with references to existing approaches, this is not a survey paper, and we make no claims of bibliographic completeness.
Acknowledgements
M. Daggitt and E. Komendantskaya acknowledge the partial support of the EPSRC grant AISEC: AI Secure and Explainable by Construction (EP/T026960/1). E. Komendantskaya was supported by ARIA: Mathematics for Safe AI grant. L. Cordeiro and E. Manino acknowledge the support of the EPSRC grant EnnCore: End-to-End Conceptual Guarding of Neural Architectures (EP/T026995/1). J. Girard-Satabin and Augustin Lemesle were supported by the French Agence Nationale de la Recherche (ANR) grant ANR-23-DEGR-0001 as part of the France 2030 programme. The work of Isac and Katz was partially funded by the European Union (ERC, VeriDeL, 101112713). Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Council Executive Agency. Neither the European Union nor the granting authority can be held responsible for them.
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
We deliberately use the term “properties" rather than “specifications" here, as the latter means the presence of a sufficiently general specification language.
Neural Network Verification is a Programming Language Challenge
verfasst von
Lucas C. Cordeiro Matthew L. Daggitt Julien Girard-Satabin Omri Isac Taylor T. Johnson Guy Katz Ekaterina Komendantskaya Augustin Lemesle Edoardo Manino Artjoms Šinkarovs Haoze Wu