Skip to main content
Erschienen in:

Open Access 2025 | OriginalPaper | Buchkapitel

Parallel Equivalence Checking of Stabilizer Quantum Circuits on GPUs

verfasst von : Muhammad Osama, Dimitrios Thanos, Alfons Laarman

Erschienen in: Tools and Algorithms for the Construction and Analysis of Systems

Verlag: Springer Nature Switzerland

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Das Kapitel untersucht die entscheidende Rolle der Äquivalenzprüfung (EC) bei der Entwicklung von großen Quantencomputern, insbesondere in der verrauschten Quantenzeitalter (NISQ). Sie konzentriert sich auf Stabilisatorschaltkreise, die klassisch simulierbar und integraler Bestandteil fehlerkorrigierter Quantencomputer sind. Die Autoren schlagen einen bahnbrechenden Ansatz vor, um die Skalierbarkeit und Nachhaltigkeit von EC zu verbessern, indem die parallelen Verarbeitungskapazitäten von Graphics Processing Units (GPUs) genutzt werden. Diese Methode erzielt erhebliche Beschleunigungen bei den EU-Berechnungen, übertrifft den Stand der Technik um den Faktor 186 und reduziert den Energieverbrauch um bis zu 98%. Das Kapitel vertieft sich in das Design und die Implementierung einer effizienten Datenstruktur für den Stabilisatortableau-Formalismus auf GPUs und beschreibt den parallelen Algorithmus, der alle parallelen Gatter gleichzeitig fegt und das Tableau gleichzeitig aktualisiert. Es bietet auch eine gründliche Bewertung des vorgeschlagenen Werkzeugs QuaSARQ im Vergleich zu bestehenden Werkzeugen wie CCEC und ECMC und demonstriert seine überlegene Leistung in Bezug auf Geschwindigkeit, Speichernutzung und Energieeffizienz. Die experimentellen Ergebnisse unterstreichen die Effektivität von QuaSARQ bei der Erzielung beispielloser Beschleunigungen für kleinere Schaltkreise mit geringer Tiefe, was es zu einem bedeutenden Beitrag auf dem Gebiet der Quantencomputertechnologie macht.
Hinweise
This work was supported by the Dutch National Growth Fund (NGF), as part of the Quantum Delta NL programme.

1 Introduction

Motivation. We are currently in the noisy intermediate-scale quantum (NISQ) era of quantum computing [25] and progressing towards machines capable of handling circuits with many qubits and gates of quantum circuits [2, 4, 29]. To build large-scale quantum computers, efficient compilation, optimization, and verification of quantum circuits are essential. Equivalence Checking (EC), a critical subroutine, plays a vital role in these tasks. Therefore, it is crucial to perform EC as efficiently as possible.
This work focuses on stabilizer circuits—an important class of quantum circuits that are classically simulatable [1, 12]. Stabilizer circuits are integral to the development of error-corrected quantum computers, serving as the foundation for the most prominent and scalable error-correcting codes. A leading example is the surface codes [7, 1012, 15, 1720, 28]. The stabilizer circuits necessary for fault-tolerant quantum computing are exceptionally large, spanning millions of qubits and gates [7, 10, 17, 19], surpassing the capabilities of current multi-core architectures like CPUs.
To further enhance scalability and sustainability of EC, we propose harnessing the Graphics Processing Units (GPUs) for parallel processing. By utilizing the abundant cores of GPUs, we achieve substantial accelerations in EC computations and in the process improve their energy efficiency [13]. Our novel algorithm efficiently performs EC for stabilizer circuits on GPUs, outperforming the state-of-the-art Clifford-Circuits Equivalence Checking (CCEC) framework (which uses Stim as a backend) [29] by an acceleration factor of 186\(\times \), while reducing energy consumption by up to 98% of energy. Stim [9] is the state-of-the-art for CPU-based simulation of stabilizer quantum circuits. It excels in simulating large circuits by utilizing the vectorized instructions in modern CPUs, enabling simultaneous execution of arithmetic and logical operations on large scale. While Stim is tailored for CPU-based simulations, the growing complexity of stabilizer circuits demands more powerful hardware solutions, such as GPUs. However, previous attempts to use GPUs for these circuits were unsuccessful. Notably, an earlier attempt by the author of Stim to accelerate tableau formalism [12] using GPUs did not achieve the desired results.1 With this work, we demonstrate that GPUs can yield significant gains in performing this task.
Related Work. The state-of-the-art tool for EC of stabilizer circuits is CCEC [29], a framework that uses Stim simulator [9] as a backend. CCEC leverages a theorem that reduces the problem of quantum circuit equivalence to a computationally tractable statement for stabilizer circuits. Specifically, evaluating the equivalence of two stabilizer circuits can be achieved using techniques designed for their efficient simulation. By replacing CCEC’s Stim components with an independent GPU-based algorithm, we can develop a more efficient EC tool.
We now discuss previous efforts to parallelize the simulation of universal quantum circuits—generally a more complex task than simulating stabilizer circuits. Accelerating the simulation of universal circuits does not inherently improve performance for special classes, such as stabilizer circuits; instead, algorithms specifically tailored for stabilizer circuits are often necessary. For instance, while Stim can efficiently handle stabilizer circuits with millions of qubits, tools designed to handle universal circuits-such as Qiskit-Aer [16], which utilizes GPUs, can scale in practice to hundreds of qubits.
Early attempts at parallelizing the simulation of quantum circuits, primarily focused on algebraic operations like matrix multiplication, which are fundamental to quantum simulation [6, 24]. While straightforward, these methods are restricted to circuits with hundreds of qubits or fewer. Others [27] use \(\textit{Z} \textit{X} \)-diagrams and parallelization to rewrite and optimize circuits to reduced versions with fewer T gates.2 Authors in [8] introduce stabilizer frames, a formalism to reduce the simulation of universal circuits to the simulation of stabilizer states. Albeit, the number of frames increases exponentially with the number of T gates.
ECMC [21] is another relevant tool that, although not parallelized, utilizes theorem 1 to perform EC. By leveraging weighted model counting, ECMC can process universal quantum circuits with hundreds of qubits. This is on par with other simulators such as [26] which uses symbolic binary decision diagrams (CFLOBDD) to simulate universal circuits.
Contributions. In summary, our contributions are as follows:
\(\star \)
Exploring the design space and analyzing various choices to derive an efficient data structure for stabilizers tableau formalism [12] on the GPU.
\(\star \)
A parallel algorithm on two levels: First, sweeping all parallel gates simultaneously within a specific time window, and second, updating the tableau concurrently on a GPU.
\(\star \)
An implementation for checking equivalence of stabilizer circuits on GPUs, leveraging concurrent kernel execution [23].
\(\star \)
Thorough evaluation of our tool QuaSARQ compared to CCEC [29] and ECMC  [21] which uses the same theorem.
Outline. We give the formal definitions of the quantum basics and the EC of the stabilizer circuits in the next section. In section 3, we analyze various design choices for the data structure of tableau formalism and present our algorithms for EC. Section 4 provides a thorough evaluation of our algorithms on the GPU. Finally, in section 5, we conclude and present some directions for future work.

2 Preliminaries

2.1 Quantum States and Fundamental Concepts

A qubit is described by a linear combination of the basis states \(\mathinner {|{0}\rangle }\) and \(\mathinner {|{1}\rangle }\) that can be identified with the 2-vectors \(\begin{bmatrix} 1 & 0 \end{bmatrix}^T\) and \(\begin{bmatrix} 0 & 1 \end{bmatrix}^T\), respectively. These states are analogous to the classical bits 0 and 1 but in the quantum case, a single qubit can take on the value of the linear combination \(\mathinner {|{\psi }\rangle }= \alpha _0 \mathinner {|{0}\rangle } + \alpha _1 \mathinner {|{1}\rangle } \) where \(\alpha _0, \alpha _1\) are complex numbers and \(\mathinner {|{\psi }\rangle }\), like any quantum state, must be normalized to be a unit vector (i.e., \(|\alpha _0|^2 + |\alpha _1|^2 = 1\)). The linear combination of basis states is also known as superposition.
The operation for combining \(n \) single-qubit states \(\mathinner {|{\psi }\rangle }_1,\cdots ,\mathinner {|{\psi }\rangle }_n\) into an \(n \)-qubit state \(\mathinner {|{\psi }\rangle }_1\otimes \cdots \otimes \mathinner {|{\psi }\rangle }_n\) is known as the tensor product. It is denoted by \(\otimes \) and is typically known in linear algebra as the Kronecker product. The \(n \)-qubit states have the form \(\sum _{\ell \in \{ 0,1 \}^{n} } \alpha _{\ell } \mathinner {|{\ell }\rangle }\) where \(\mathinner {|{\ell }\rangle }\) is the conventional way of denoting a string of multiple tensored single-qubit states, i.e., we suppress the \(\otimes \) symbol and say \(\mathinner {|{1011}\rangle }\) instead of \(\mathinner {|{1}\rangle }\otimes \mathinner {|{0}\rangle } \otimes \mathinner {|{1}\rangle } \otimes \mathinner {|{1}\rangle }\). If a quantum state cannot be factored into a tensor product of its individual subsystems, then we say it is an entangled state. For example, the state \(\frac{1}{\sqrt{2}}(\mathinner {|{00}\rangle }+\mathinner {|{11}\rangle })\) is entangled, while the state \(\mathinner {|{1001}\rangle }\) is not.
In general, a quantum state \(\mathinner {|{\psi }\rangle }\) of \(n \) qubits can be written as a vector of \(2^n \) complex numbers: \(\mathinner {|{\psi }\rangle }=[\alpha _0,\alpha _1,\cdots ,\alpha _{2^n}]^T\). For every state, there is an adjoint defined as \((\mathinner {|{\psi }\rangle }^*)^T\) and denoted by \(\mathinner {\langle {\psi }|}\). The “\(*\)” operator is a complex conjugation, meaning that we replace every entry with its complex conjugate, then transpose the vector (i.e., transforming it into a row vector). The normalization requirement, can be now reformulated into \(\mathinner {\langle {\psi }|}\cdot \mathinner {|{\psi }\rangle }=1\). This product is also known as inner product and can be performed between distinct states.

2.2 Stabilizer Circuits

Quantum states can be manipulated by two types of operations: quantum gates and quantum measurements. Quantum gates are unitary matrices, representing reversible linear transformations denoted as \(U\in \mathbb {C}^{2^n \times 2^n}\). These gates preserve both the norm and the inner product of quantum states. A key property of unitary gates is that their inverse, denoted \(U^{\dagger }\), satisfies \(UU^{\dagger }=U^{\dagger }U=I\), where \(U^{\dagger }=(U^*)^T\) represent the conjugate transpose of U. The effect of gate U acting on a state \(\mathinner {|{\psi }\rangle }\), can be calculated using matrix-vector multiplication \(U\cdot \mathinner {|{\psi }\rangle }\).
Two important (but non-universal) gate sets are the Pauli gates, defined as the following unitary matrices:
and the Clifford gate set (CX is also called CNOT):
Fig. 1.
A 2-qubit stabilizer circuit.
Stabilizer circuits, also known as Clifford circuits are quantum circuits composed of Clifford gates. The quantum states generated by applying these circuits to the computational basis \(|0\rangle ^{\otimes n}\) are called stabilizer states. Interestingly, stabilizer circuits can be simulated efficiently by a classical algorithm [1, 12].
Example 1
Consider the circuit shown in fig. 1, which starts with the initial state \(\mathinner {|{00}\rangle }\). The horizontal wires represent the qubits, and the dashed lines indicate the intermediate states. The final state \(\mathinner {|{\psi _3}\rangle }\) is obtained by sequentially applying the gates to the initial state \(\mathinner {|{00}\rangle }\) from left to right.
1.
Apply the Hadamard gate (H) to the first qubit: \(\textit{H} \mathinner {|{0}\rangle } = \frac{1}{\sqrt{2}}(\mathinner {|{0}\rangle } + \mathinner {|{1}\rangle })\). The circuit’s state becomes: \(\mathinner {|{\psi _1}\rangle } = \frac{1}{\sqrt{2}} (\mathinner {|{00}\rangle } + \mathinner {|{10}\rangle }).\)
 
2.
Apply the Controlled-X gate (CX) to both qubits. The effect of CX is to flip the target qubit when the control qubit is \(\mathinner {|{1}\rangle }\). Thus, \(\mathinner {|{\psi _2}\rangle } = \textit{CX} \mathinner {|{\psi _1}\rangle } = \frac{1}{\sqrt{2}} (\mathinner {|{00}\rangle } + \mathinner {|{11}\rangle }).\)
 
3.
Apply the Pauli-Z gate (Z) to the second qubit. Since \(\textit{Z} \mathinner {|{0}\rangle } = \mathinner {|{0}\rangle }\) and \(\textit{Z} \mathinner {|{1}\rangle } = -\mathinner {|{1}\rangle }\), the final state is: \(\mathinner {|{\psi _3}\rangle } = \textit{Z} \mathinner {|{\psi _2}\rangle } = \frac{1}{\sqrt{2}}(\mathinner {|{00}\rangle } - \mathinner {|{11}\rangle })\).
 
Equivalently, we could represent the entire circuit in the previous example as a unitary operator \(U = (\textit{I} \otimes \textit{Z} \,) \cdot \textit{CX} \cdot (\textit{H} \otimes \textit{I} \,)\), and then apply it to \(\mathinner {|{00}\rangle }\) to compute the final state. Generally, every circuit corresponds to a unitary operator in \( \mathbb {C}^{2n \times 2n}\).
The Pauli group \(\mathcal {P} \) on \(n \) qubits is defined as the set of operators of the form \(\mathcal {P} =\pm \mathcal {P} _1 \otimes \cdots \otimes \mathcal {P} _n\) with \(\mathcal {P} _i\in \{ \textit{I}, \textit{X}, \textit{Y}, \textit{Z} \,\}\). For simplicity, the tensor product is often suppressed, and \(n \)-Pauli strings are written in shorthand. For example, \(\textit{X} \otimes \textit{I} \otimes \textit{Y} \otimes \textit{X} \otimes \textit{Z} \) can be compactly written as XIYXZ. The Clifford group C on \(n \) qubits consists of the unitary operators V that normalize the Pauli group \(\mathcal {P}\), that is, \(C=\{ V\in \mathbb {C}^{2n \times 2n} \ | \ V\mathcal {P} V^{\dagger }=\mathcal {P} \}\). The Clifford group is generated by the Clifford gate set [22]. Substituting V with any Pauli gate satisfies the normalization condition \(V\mathcal {P} V^{\dagger }=\mathcal {P} \). This shows that Pauli gates belong to the Clifford group, and thus they can be written as a combination of gates from the Clifford gate set. Each stabilizer state, generated by a stabilizer circuit, can be uniquely encoded by a subgroup of the Clifford group, requiring exactly \(n \) independent generators, where each generator is an \(n \)-Pauli string.
Stabilizer states are called this way because they are stabilized by certain operators. A unitary operator U stabilizes a state iff \(U \mathinner {|{\psi }\rangle }=\mathinner {|{\psi }\rangle }\). For example, the Pauli string IZ stabilizes the state \(\mathinner {|{0}\rangle }\otimes \mathinner {|{0}\rangle }= \mathinner {|{00}\rangle }\). A state can be stabilized by multiple Pauli strings; for instance, \(\mathinner {|{00}\rangle }\) is also stabilized by II, ZI and ZZ. These Pauli strings generate the maximal commutative subgroup \(\mathcal {H}_{\max }=\{\textit{II}, \textit{IZ}, \textit{ZI}, \textit{ZZ}\,\}\), which is the stabilizer group of \(\mathinner {|{00}\rangle }\). Each maximal commutative subgroup of the Pauli group stabilizes exactly one stabilizer state [22].
Although the stabilizer group for an \(n \)-qubit stabilizer state has \(2^n \) elements, only \(n \) Pauli strings are needed to generate such a group. For example, \(\mathcal {H}_{\max }\) can be generated by the two generators \(\{\textit{IZ}, \textit{ZZ}\,\}\). This property allows stabilizer states to be efficiently and succinctly represented by a data structure such as the tableau formalism.
Tableau formalism. A stabilizer state’s generators can be described by a binary table known as a stabilizer tableau [12]. We refer to the tableau as the tuple \(\mathcal {T} = (\textit{X} _M, \textit{Z} _M, \textit{S} _M)\), where the matrices \(\textit{X} _M\) and \(\textit{Z} _M\) together represent the Pauli strings and \(\textit{S} _M\) encodes their signs (see eq. (1)). Each row encodes a string of \(n \) Pauli operators and has size \(2n +1\), where \(n \) is the number of qubits. If the (ij)-th entry of \(\textit{X} _M\) is 1 and the corresponding entry in \(\textit{Z} _M\) is 0, then the j-th term of the i-th Pauli string corresponds to an X operator. Analogously, a 0 in \(\textit{X} _M\) and 1 in \(\textit{Z} _M\) correspond to a \(\textit{Z} \) operator. If both entries are 0, it represents the identity operator \(\textit{I} \). Conversely, if both entries are 1, it represents a \(\textit{Y} \) operator. For example, the tableau below encodes the generators \(\{ \textit{-ZII}, \textit{IZI}, \textit{IXY}\,\}\).
https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-90660-2_6/MediaObjects/652620_1_En_6_Equ1_HTML.png
(1)
To update \(\mathcal {T} \) into a new state, after the action of a Clifford gate on a qubit \(q \), one needs to apply the following updating rules for all \(i \in \{1,\dots ,n \}\) [12]:
$$\begin{aligned} \begin{aligned} & \textit{H} _q: \textit{S} _M(i) := \textit{S} _M(i) \oplus \bigl ( \textit{X} _M(i, q) \wedge \textit{Z} _M(i, q) \bigl ), \\ & \quad \quad \text {Swap}\bigl ( \textit{X} _M(i, q), \textit{Z} _M(i, q) \bigl ); \\ & \textit{P} _q: \textit{S} _M(i) := \textit{S} _M(i) \oplus \bigl ( \textit{X} _M(i, q) \wedge \textit{Z} _M(i, q) \bigl ), \\ & \quad \quad \textit{Z} _M(i, q) := \textit{Z} _M(i, q) \oplus \textit{X} _M(i, q); \\ & \textit{CX} _{q_1,q_2}: \textit{S} _M(i) := \textit{S} _M(i) \oplus \bigl ( \textit{X} _M(i, q_1) \wedge \textit{Z} _M(i, q_2) \wedge \lnot \bigl (\textit{X} _M(i, q_2) \oplus \textit{Z} _M(i, q_1) \bigl ) \bigl ), \\ & \quad \quad \quad \quad \textit{Z} _M(i, q_1) := \textit{Z} _M(i, q_1) \oplus \textit{Z} _M(i, q_2), \\ & \quad \quad \quad \quad \textit{X} _M(i, q_2) := \textit{X} _M(i, q_2) \oplus \textit{X} _M(i, q_1); \end{aligned} \end{aligned}$$
(2)
Observe that collapsing the signs involves the sequential XOR operation (\(\oplus \)), which poses a challenge for parallelization. Algorithm 2 addresses this issue, though it increases the algorithm’s complexity. The following example shows how gates can be applied in parallel to \(\mathcal {T} \) when acting on independent qubits, hence affecting independent columns.
Example 2
Given a 3-qubit system initialized to the \(\mathinner {|{000}\rangle }\) state, assume that the gates \(\textit{H} _1\) and \(\textit{P} _3\) are applied in order, to the qubits 1 and 3 respectively. The affected columns according to eq. (2), are updated and highlighted by colours:
https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-90660-2_6/MediaObjects/652620_1_En_6_Equ3_HTML.png
(3)

2.3 Equivalence Checking

As each quantum circuit corresponds to a unitary operator, a naive approach to EC of two circuits UV would be to compare the exact values of the operators that are given in a decomposed form as sequences of gates. Specifically, one would compose the gates by matrix multiplication to obtain the two operators and check if \(U=cV\), where \(c\in \mathbb {C}\) with norm 1. The constant c is called global phase and has no physical significance—It is unobservable [22]. While this approach would work in principle, the matrix will grow exponentially w.r.t. the number of qubits.
Fortunately, there exists a more efficient approach, particularly suited for stabilizer circuits. It exploits a mathematical result (theorem 1 below) that enables equivalence evaluation by computing in the stabilizer basis. As we will see, the tableau formalism rigorously facilitates this task. To evaluate if two stabilizer circuits U, V are equivalent, it suffices to check whether their outputs result in the exact same tableau when their simulation is initialized with the \(\textit{I} _\textit{Z} \) tableau.3 Additionally, their tableaux should be identical when initialized with the \(\textit{I} _\textit{X} \) tableau.4 It is important to emphasize that this is not the same as comparing the output states of two simulated circuits. When a tableau is used for simulation, the generators that are used to generate the group which encodes some state, are not unique. Therefore, in the case of simulation, you may have different tableaux which encode the same state. But in our case, since we are not interested in comparing states but the actual tableaux, the outputted generators, and thus the tableaux, must be identical. This process can efficiently decide equivalence up to a global phase factor. See theorem 1 below and its proof in [29].
Theorem 1
Let U,V be unitaries, each of them corresponding to an \(n \)-qubit circut. Then \(U\cong V\) if and only if for all \(\mathcal {P} \in \{ \textit{X}, \textit{Z} \,\}\) and \(i\in [n ]\), the condition \(U\mathcal {P} _{i}U^{\dagger }=V\mathcal {P} _i V^{\dagger }\) holds, where \(\mathcal {P} _i=\textit{I} ^{\,\otimes i-1} \otimes \mathcal {P} \otimes \textit{I} ^{\,\otimes n-i}\).
When referring to the equivalence of two circuits \(\mathcal {C},\mathcal {C} '\) that are not considered as single unitary operators but as a circuit (i.e. a sequence of gates), by abuse of notation, we will use \(\mathcal {C} \cong \mathcal {C} '\) when \(\mathcal {C} \) and \(\mathcal {C} '\) encode the decomposition of equivalent unitary operators.

2.4 GPU Programming

CUDA [23] is a programming interface that enables general-purpose programming for a GPU. It has been developed and continues to be maintained by NVIDIA since 2007. In this work, we use CUDA with C++. Therefore, we use CUDA terminology when we refer to thread and memory hierarchies.
A GPU consists of a finite number of streaming multiprocessors (SM), each containing hundreds of cores. For instance, the RTX 4090, which we used for this work, has 128 SMs containing 16,384 cores. A GPU computation can be launched in a program by the host (CPU side of a program), which calls a GPU function called a kernel, executed by the device (GPU side of a program).
When a kernel is called, the number of threads needed to execute it is specified. According to the CUDA paradigm [23], these threads are partitioned into thread blocks, i.e., 3-dimensional (3D) vectors grouping threads up to 1,024. Each thread block is assigned to an SM. All threads together form a 3D grid where threads and blocks can be indexed by a 1D, 2D, or 3D unique identifier (ID) accessible within the kernel. This ID works similarly as the (\(x\), \(y\), \(z\)) coordinates in 3D space. With this ID, different threads in the same block can process multi-dimensional data (e.g., \(\textit{X} _M\) and \(\textit{Z} _M\)).
Memory Hierarchy. A GPU has multiple types of memory, with the largest being the global memory. This memory can be used to transfer data between the host and the device. It is accessible by all GPU threads and offers high bandwidth but also high latency. Other types are shared memory and registers. The former is on-chip memory with a low latency, used as block-local memory. Threads within a block can share data via this memory. Registers are the fastest and are used to store thread-local data. Yet, their size is very small. Allocating too much memory for thread-local variables can cause data to spill over into global memory, dramatically reducing performance.
GPU threads can use atomic operations to manipulate data atomically in both global and shared memory. For example, atomicXOR (addr, val) atomically XORs the element stored at addr with the 32-bit or 64-bit value val.

3 Parallel Equivalence Checking on GPUs

In this section, we begin with evaluating the design decisions for the storing and updating of the data structure that we later use for efficient EC. The data structure is based on the stabilizer tableau but optimized for GPU usage. Next, we introduce the implementation for EC on the GPU, taking advantage of our optimized data structure. Our proposed implementation incorporates two levels of parallelism. First, we process different generators (Pauli strings) independently in parallel. Second, we maximize parallelism by processing as many parallel gates as possible simultaneously. However, calculating the signs of the Pauli strings in parallel remains a challenge. We present a tree-based approach for collapsing signs on-the-fly.

3.1 Data Structure Design

Fig. 2.
Logical and physical formats of a word-aligned tableau.
Upon designing the data structure for storing and updating the stabilizer tableau on a GPU, several key considerations were made to ensure high efficiency of both the GPU memory and execution performance. GPUs operate on aligned memory, meaning data are stored and accessed in words (e.g., 8, 32, or 64 bits). This allows multiple bits to be processed simultaneously with a single instruction (e.g. XOR), which is crucial for fast tableau updates. To represent the stabilizer tableau, we encode the binary form of the tableau as described in section 2.2, eq. (1). Figure 2 visualizes the logical format of tableau \(\mathcal {T} \). Words in \(\textit{X} _M\), \(\textit{Z} _M\), and \(\textit{S} _M\) are indexed by \(i \in [k ]\) and \(j \in [n ]\), where \(k\) is the number of generators divided by the word length in bits. The tableau size is thus \(k \cdot (2n + 1)\) words.
Memory Alignment. The tableau is stored in 64-bit words, the maximum size supported by the GPU’s architecture, enabling the processing of 64 bits in parallel with a single instruction. This reduces the overall number of operations and boosts performance, as demonstrated in our experiments in fig. 3a. The experiments were conducted using benchmarks introduced in section 4. Notably, the 64-bit word size outperforms its counterparts significantly. The main reason for the performance peak lies in the fact that a 64-bit word allows for the execution of a single instruction on 64 different bits simultaneously. This minimizes the computational steps, leading to faster tableau updates.
Memory Layout. Given the nature of the tableau updates when applying Clifford gates (cf. example 2 in section 2.2), it is logical to store the \(\textit{X} _M\) and \(\textit{Z} _M\) in a column-major format (See fig. 2 in the bottom). This ensures each column can be accessed contiguously in memory, improving locality during tableau updates.
In addition, we experimented with interleaving the \(\textit{X} _M\) and \(\textit{Z} _M\) matrices at both the word and column level. The rationale behind interleaving is to improve memory locality by storing data that will be accessed together (i.e., from the same qubit) close to each other in memory. However, our results showed that interleaving did not provide a performance benefit despite the fact that the gate updates require communication between the \(\textit{X} _M\) and \(\textit{Z} _M\) columns. We hypothesize that the increased complexity in memory indexing mitigates any potential benefits from interleaving. As shown in fig. 3b and fig. 3c, separate storage of \(\textit{X} _M\) and \(\textit{Z} _M\) matrices yielded better performance. This finding is consistent with previous results from Stim  [9] simulator where word interleaving was ruled out, albeit no performance analysis was provided.
Fig. 3.
Performance impact of different architectural choices.
Tableau Slicing. For circuits with hundreds of thousands of qubits and more, storing the entire tableau in memory can become infeasible. To tackle this, we implemented a tableau slicing strategy. The tableau is sliced horizontally into smaller, non-square blocks (not to be confused with thread blocks in CUDA), each of which can be processed independently (recall that generators are independent as demonstrated in example 2). This allows us to check the equivalence of circuits with large qubit counts using only the available GPU memory.
Each block is initialized to the same state as the full tableau. For example, in a circuit with 1 million qubits, the tableau would require approximately 250GB of memory. By slicing the tableau into smaller sections, we can reduce this memory requirement to fit within the available GPU memory (24GB in our experiments), as shown in table 1.
Example 3
For a system with 256 qubits, if we store the tableau in 64-bit words, the full tableau would require \(k \cdot (2n + 1) = (256 / 64) \cdot 513 = 2,052\) words in memory. By slicing the tableau’s generators into two halves, each block will require 1,026 words, and the update operations can be performed independently on each block. This strategy ensures that memory is fully utilized while maintaining computational consistency across blocks.

3.2 The Main Algorithm for Parallel Equivalence Checking

Given two circuits, denoted as \(\mathcal {C} \) and \(\mathcal {C} '\), our primary objective is to verify whether (\(\mathcal {C} \cong \mathcal {C} '\)) by deploying the GPU massive resources in the tableau formalism. To achieve this, we must define gate parallelism and determine how to schedule their execution. Informally, parallelizable gates are those that can be executed concurrently because their corresponding operations commute and are independent (i.e., not causally related). We formalize this in terms of causal dependency as follows.
Fig. 4.
Input circuit on the left. Scheduled circuit on the right.
Definition 1
(Gate Dependency Relation). Given a circuit \(\mathcal {C}\), the causal dependency between gates defines a partial order “\(\preceq \)” in the set of gates \(g \in \mathcal {C} \), such that \( g \preceq g ' \) iff \(g '\) is causally dependent on \(g \).
The transitivity of the relation guarantees that the following definition of parallel gates is well defined with no dependencies between gates.
Definition 2
(Parallel gates). Given a circuit \(\mathcal {C}\), a set \(W \subseteq \mathcal {C} \), is a set of parallel gates (or window) iff \(\forall g, g ' \in W, \ g \npreceq g ' \ \wedge \ g ' \npreceq g.\)
Definition 3
(Maximal Window). A window \( W \subseteq \mathcal {C} \) is maximal iff there is no other window \( W' \subseteq \mathcal {C} \) such that \( W \subsetneq W' \).
Henceforward, by a window, we mean a maximal window and denote it by \(\mathcal {W} \).
The above definitions form the basis for the scheduleWindows routine of Algorithm 1, which outlines the main steps for EC on the GPU. Details of this routine will be discussed later. Given input circuits \(\mathcal {C} \) and \(\mathcal {C} '\), the algorithm checks their equivalence and returns the outcome in the Boolean variable isEquivalent. The CPU schedules the circuits \(\mathcal {R}, \mathcal {R} '\) and transfers them to the GPU memory, while the GPU handles the updates of the tableaux \(\mathcal {T} , \mathcal {T} '\) and checks for equivalence. Typically, the input circuit is stored as a sequence of symbols, one for each gate, along with the associated qubits on which they act. Standard examples are OpenQASM [5] or the Stim [9] format. Figure 5 illustrates the workload distribution between the CPU and GPU for those tasks.
The scheduleWindows procedure (l.1) maximizes parallelism by organizing the gates into maximally independent windows (cf. definition 3). This scheduling ensures efficient utilization of GPU resources during tableau updates. When a maximal window is constructed, we schedule additional windows until all the gates in \(\mathcal {C} \) are assigned to \(\mathcal {R} \). The procedure works as follows: starting with the input circuit (left part of fig. 4) stored in a queue, we fill an empty window with parallel gates (removed) from the queue until either the window reaches its maximal size \(|\mathcal {W} | = n \), or no more parallel gates can be added due to dependencies (definition 1). This process is repeated, scheduling new windows until the queue is empty (see the right part of fig. 4 for the end result).
The number of windows in \(\mathcal {R}\) is proportional to the circuit depth (i.e., maximum number of gates per wire). In circuits with minimal gate dependencies, more gates can be scheduled into a single window, reducing the effective depth and improving performance by allowing more parallel updates to the tableau.
Algorithm 1
Parallel Equivalence Checking.
Fig. 5.
Workload of algorithm 1.
Next, at l.2, \(\mathcal {R}, \mathcal {R} '\) are copied asynchronously to the device. The subscript d in \(\mathcal {R} _d\) denotes the device version of \(\mathcal {R}\). Asynchronous data transfers can be performed concurrently w.r.t. the host. For example, the tableau allocation and slicing can proceed at l.3 while the circuit transfer is pending. The allocateTableaux function blocks execution until all prior data transfers synchronize. It takes \(2 n (2n + 1)\) bits to allocate enough space for two tableaux (one per circuit). Tableaux are allocated physically in arrays of 64-bit words as explained in section 3.1. Words are accessed in column-major. If the tableaux fit entirely into the device memory, the values of numTableauBlocks and \(k\) will be 1 resp. \(n/ 64\). Otherwise, \(\texttt {numTableauBlocks} \) is some value greater than 1 such that \(0 < k \le (n/ 64)\). Both tableaux have identical dimensions as the number of qubits is the same for both circuits. For simplicity, we use \(\mathcal {T} \) to refer to either the entire tableau or any of its individual blocks, depending on the context. As illustrated in fig. 5, the CPU thread’s workload is small compared to the GPU tasks and does not add any overhead to the overall runtime.
The loop at l.4 checks \(\mathcal {R} _d\) and \(\mathcal {R} '_d\) for equivalence for every block \(b \) via the routine checkEquivalence. At l.5, we first check under the initial value of the tableau \(\textit{I} _\textit{Z} \) (cf. section 2.3 for more details). If the circuits are equivalent (l.6), we try again with the tableau \(\textit{I} _\textit{X} \) (l.7).
checkEquivalence is defined at lines 9-16. Operations highlighted in https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-90660-2_6/MediaObjects/652620_1_En_6_Figf_HTML.gif are processed by the GPU. At lines 10-11 tableaux are set to their initial value \(\texttt {initState} \) concurrently via the streams \(\texttt {stream1}\) and \(\texttt {stream2}\). If there are more than one block, i.e., \(b > 1\), the next block must be offset by (\((b- 1) \cdot k \)) words. Similarly, the loop at l.12 launches two concurrent kernels updateTableauAsync via the streams \(\texttt {stream1}\) and \(\texttt {stream2}\) to update the tableaux for each window in both circuits. A stream is a sequence of instructions that are executed in issue-order on the GPU [23]. Using streams allows concurrent execution of kernels, provided there are enough unoccupied SMs. See fig. 5 for the distribution of our kernels over \(\texttt {stream1}\) and \(\texttt {stream2}\). Observe that these kernels may not completely overlap depending on the availability of SMs. Our experiments showed that kernels overlapping leads to 25% gain in kernel performance compared to non-overlapping. Once all windows are processed, both streams are synchronized at l.15 to ensure the stabilizers in both tableaux are fully built. At 16, we check the words in \(\mathcal {T} \) against \(\mathcal {T} '\) and return the outcome of the comparison. Next, we explain how updateTableauAsync is implemented.

3.3 Tableau Manipulation on GPU

The kernel in algorithm 2 updates the tableau \(\mathcal {T} \) by applying parallel gates stored in \(\mathcal {W} _d\). Within this process, threads in the \(y\)-dimension (l.2) are responsible for fetching new words per column, whereas threads in the \(x\)-dimension (l.3) handle the retrieval of parallel gates. The variables \(\texttt {tid} _{y} \) and \(\texttt {tid} _{x} \) hold the global thread ID in \(y\)- and \(x\)-dimension, respectively. To branch over all supported gates by our algorithm, a switch statement is used at l.4. Here, we only show the H gate but many more Clifford gates are directly supported without decomposition. At lines 5–9, we implement the update rules executed for each gate modifying both the words encoding the Pauli strings and their signs. Recall that \(\mathcal {T} \) is stored in a column-major array; thus, given a qubit \(q \), and the column length \(k\), a column in the \(\textit{X} _M\) or \(\textit{Z} _M\) matrix can be accessed in position (\(q \cdot k \)). To access words per column in parallel, the thread ID (\(\texttt {tid} _{y} \)) is used to offset the column index.
The function \({\textsc {getQubits}} \) at l.6 maps the input gate to its connected wires or qubits. Qubit values are assumed to start from 0 to \(n- 1\) to align with the array index. At l.7, the signs are calculated and temporarily stored in shared memory (denoted as sh). The variable \(\texttt {sh} [\texttt {t}_{x} ]\) holds the thread-local sign per thread block in the \(x\)-dimension. To ensure each thread block has written its signs to sh before proceeding, we need to synchronize at l.10. Once synchronized, these signs need to be collapsed into a final result, which will be used to update the global sign in the tableau. This collapsing process is handled by a specialized procedure, collapseSigns, which we will describe in more detail next. The main update to the tableau occurs at l.8, where the words in \(\textit{X} _M\) and \(\textit{Z} _M\) (representing the Pauli strings) are swapped in global memory.
Algorithm 2
Tableau Manipulation on GPU.
Signs Collapsing. In our initial attempt to parallelize the signs collapsing process, we employed two kernels. The first kernel handled the tableau updates and the local sign calculations, storing the results in global memory. The second kernel read these local signs from global memory and performed the collapsing using the atomic operation atomicXOR. Unfortunately, this trial proved to be considerably slow due to the excessive global memory accesses (approximately \(k \cdot n \)) and the high usage of atomic instructions (\(k \cdot |\mathcal {W} _d|\)).
To remedy the global memory access issue, we made a second attempt. In this trial, we eliminated the temporary storage of local signs in global memory. Instead, we collapsed all local signs atomically on-the-fly during the tableau update within the whole window. While this approach saved roughly (\(k \cdot n \)) memory accesses compared to the initial attempt, the number of atomic operations remained a bottleneck. This is where our tree-like approach comes into play. The collapseSigns routine in algorithm 2 effectively reduces the number of atomics by collapsing signs per thread block using a bottom-up binary tree. The downside is the excessive usage of global memory during the tree build-up, which we mitigate by utilizing the fast on-chip shared memory.
Fig. 6.
Running example of algorithm 2 on 128-qubit system.
The loop at l.13 performs the collapsing using the XOR operator in shared memory by sweeping the binary tree in a bottom-up manner. Initially, \(l \) is set to \(B_x/ 2\), where \(B_x = 2^w\) and \(w \in \{1, 2, \ldots , 10\}\). At l.14, we XOR in parallel the first half in \(\texttt {sh} [\texttt {t}_{x} ]\) with the second half \(\texttt {sh} [\texttt {t}_{x} + l ]\). Afterwards, threads are synchronized and \(l \) is halved until only one element remains at \(\texttt {sh} [0]\) (resembles the root of the tree), which holds the collapsed block-local sign. Finally, at l.17, we collapse the block-local signs stored at \(\texttt {sh} [0]\) atomically with the old global signs at \(\textit{S} [\texttt {tid} _{y} ]\). This approach introduces an additional \(log(B_x)\) steps to jump between tree levels (l.13). Albeit, the number of atomic operations is reduced from \(k \cdot |\mathcal {W} _d|\) to \(k \cdot |\mathcal {W} _d| / B_x \).
Figure 6 depicts a running example of algorithm 2 executed on the window \(\mathcal {W} = \{\textit{P} _{120}, \textit{H} _{5}, \textit{H} _{10}, \textit{P} _{100}\}\) with (\(n = 128\)). For that, we require a tableau of dimensions (\(k \cdot (2n + 1) = 2 \cdot (256 + 1)\)). As a result, the number of threads in the \(y\)-dimension is 2, and 4 in the \(x\)-dimension. In that case, one thread block for both dimensions is sufficient (e.g. \(B_x\) = 4, \(B_y\) = 2). Once the collapsing process is complete and threads are synchronized per block, the root of the tree, denoted as \(\texttt {sh} [0]\), holds the final collapsed word. If multiple thread blocks are involved in the \(x\)-dimension, their results are combined using an atomicXOR operation (as shown in l.17). In this specific example, as we only have one thread block in the \(x\)-dimension, \(\texttt {sh} [0]\) can be directly written to \(\textit{S} [\texttt {tid} _{y} ]\) non-atomically.

3.4 Complexity Analysis

The overall complexity of algorithm 2 is influenced by two factors: the number of words per column \(k\) and the number of gates \(|\mathcal {W} _d|\). Given that the upper bound of both factors is the number of qubits \(n\), the worst-case parallel complexity can be expressed as \(\mathcal {O}(\frac{n}{N _{y}}\frac{n}{N _{x}} + \log (B_x))\). Here, \(\log (B_x)\) indicates the number of sequential steps at l.13, while \(N _{y}\) and \(N _{x}\) represent the total threads launched in the \(y\)- and \(x\)-dimensions, respectively. Note that the upper bound of the logarithmic steps is limited to \(B_x\), not \(n\), since the maximum size of a thread block is 1,024 threads across all dimensions [23].
The circuit depth (denoted by d) also impacts the EC performance overall. As the depth increases and more gates become causally dependent, fewer gates can be executed in parallel, leading to more sequential tableau updates (cf. the loop at l.12 in algorithm 1). Thus, the total time complexity considering the circuit depth becomes \(\mathcal {O}(d \cdot \frac{n}{N _{y}}\frac{n}{N _{x}} + \log (B_x))\). For shallow circuits with highly parallelizable gates, the runtime is significantly reduced, while deeper circuits with more dependencies between gates lead to a closer-to-linear scaling in depth.

4 Experimental Evaluation

Setup. We implemented algorithm 1 in a new tool called QuaSARQ using CUDA C++. The code was compiled with CUDA 12.4 targeting compute capability 8.9. The GPU experiments were conducted on a machine running Ubuntu 22.04, equipped with an RTX 4090 GPU, featuring 16,384 cores at 2.23 GHz and 24GB of global memory. We compare QuaSARQ with CCEC,5 which is based on Stim—the state-of-the-art simulator for stabilizer circuits. Additionally, we evaluate our tool against ECMC,6 which employs the GPMC[14] model counter7 and supports equivalence checking for both Clifford and universal circuits using theorem 1. CCEC and ECMC experiments are conducted separately on the compute nodes of DAS-6 [3] cluster to dedicate our computing hours on the GPU machine to QuaSARQ experiments. Each node of DAS-6 had an AMD EPYC 7282 CPU (2.8 GHz) with 256GB of memory. Circuits were tested in isolation on a separate computing node, with a timeout of 5,000 seconds. It is worth mentioning that DAS-6’s CPU clock is 20% faster than the GPU clock running QuaSARQ.
Since our EC algorithm is deterministic, runtimes can be extrapolated based solely on gate counts. Therefore, evaluating the algorithm on a diverse set of quantum algorithms is unnecessary; testing with randomly generated circuits suffices. To evaluate our algorithm, we generated 500 random circuits using Qiskit8 in OpenQASM format [5]. The number of qubits in these circuits ranges from 1K to 500K, increasing in steps of 1K, with a fixed depth of 100. We chose this value to test the effectiveness of our algorithms in circuits of moderate depth. The total number of gates ranges from 68K to 34M. By default, Qiskit generates uniformly distributed gates in the Clifford group {X, Y, Z, H, P, \(\textit{P} ^\dagger \), CX, CY, CZ, SWAP, ISWAP}, which are supported by our tool and Stim. These gates are common Clifford gates and they can be decomposed into a combination of the basic gates {H, P, CX }. To simulate erroneous circuits for EC, we randomly selected a gate in each circuit and replaced it with another randomly chosen gate in the Clifford group. Thus, there is a \(\frac{1}{11}\) chance of the gate remaining unchanged.
To complement these benchmarks and evaluate ECMC’s performance against QuaSARQ under reasonable computational resources, we generated a smaller set of 91 circuits. In this reduced dataset, the number of qubits was scaled down to 10K, and the circuit depth was limited to a single level, ensuring the tests could be performed within practical time and memory limits.
Fig. 7.
EC Performance of QuaSARQ against CCEC across 500 circuits (qubits count: 1K-500K, circuit depth: 100)
Fig. 8.
EC Performance of QuaSARQ against ECMC across 91 circuits (qubits count: 1K-10K, circuit depth: 1)
Experiments. Figures 8a and 8b show cactus plots of the runtime and the energy consumption of CCEC vs. QuaSARQ, respectively. Similarly, figures 8a and 8b compares QuaSARQ to ECMC. QuaSARQ ’s time includes the initial time (memory allocations, parsing, etc.), the scheduling time to obtain \(\mathcal {R}\) and \(\mathcal {R} '\), the transfer time to send \(\mathcal {R}\) to the device, and the checking time of algorithm 1. To measure the energy draw, we multiplied the runtime to the average power consumed by the processing unit over the life cycle of both applications. Overall, QuaSARQ achieves an average speedup of 81\(\times \), and saves energy by 98% compared to CCEC. Against ECMC, QuaSARQ delivers an acceleration of six orders of magnitude for EC of stabilizer circuits. This is not surprising, as ECMC targets universal quantum circuits. The results highlight the effectiveness of QuaSARQ in achieving unparalleled speedups for smaller circuits with shallow depth, consistent with our complexity analysis in section 3.4.
Table 1 reports statistics for a selection of ten test cases. Evidently, the initial and scheduling times are negligible compared to the simulation time. Regarding tableau allocation, QuaSARQ, with its tableau slicing mechanism, occupies at most 22GB of memory regardless of the number of tableaux or qubits being simulated. For EC, two tableaux are required (one per circuit). Refer to tableau memory in Table 1 for precise numbers. Even when QuaSARQ supports measurements, 24GB is sufficient for the additional required tableau, the destabilizer tableu, albeit with more blocks. The destabilizer tableau is for keeping track of anti-commutators necessary for simulating measurements [1]. In contrast, Stim in CCEC and ECMC consumed up to 256GB of memory before eventually running out. This underscores QuaSARQ ’s ability to reduce memory usage by up to 90%.
Table 1.
Statistics for a selection of ten circuits.
https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-90660-2_6/MediaObjects/652620_1_En_6_Tab1_HTML.png
Dummy

5 Conclusions and Future Work

We presented a parallel algorithm for equivalence checking of stabilizer circuits on GPUs and developed a new tool called QuaSARQ. Experimental results show that QuaSARQ significantly outperforms the current state of the art, with a 186\(\times \) speedup compared to CCEC, which leverages Stim—a simulator that utilizes the SSE2/AVX2 vectorized instructions. To the best of our knowledge, this is the first GPU-based equivalence checker that surpasses CCEC in terms of time, memory, and energy.
Looking ahead, we aim to enable measurements on GPUs to support the full simulation of quantum stabilizer circuits.
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://​creativecommons.​org/​licenses/​by/​4.​0/​), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Fußnoten
1
Refer to section 5.3 (GPU Experiment Failure) in [9, p.14].
 
2
T gates are notoriously hard to simulate [22].
 
3
Tableau with the \(\textit{Z} _M\) matrix filled with 1s in its diagonal and 0s everywhere else.
 
4
Tableau with the \(\textit{X} _M\) matrix filled with 1s in its diagonal and 0s everywhere else.
 
7
We attempted to use gpusat2, but it failed on all circuits due to memory overflow.
 
Literatur
3.
Zurück zum Zitat Bal, H., Epema, D., de Laat, C., van Nieuwpoort, R., Romein, J., Seinstra, F., Snoek, C., Wijshoff, H.: A Medium-Scale Distributed System for Computer Science Research: Infrastructure for the Long Term. IEEE Computer 49(5), 54–63 (2016). https://doi.org/10.1109/MC.2016.127 Bal, H., Epema, D., de Laat, C., van Nieuwpoort, R., Romein, J., Seinstra, F., Snoek, C., Wijshoff, H.: A Medium-Scale Distributed System for Computer Science Research: Infrastructure for the Long Term. IEEE Computer 49(5), 54–63 (2016). https://​doi.​org/​10.​1109/​MC.​2016.​127
4.
Zurück zum Zitat Berent, L., Burgholzer, L., Wille, R.: Towards a SAT Encoding for Quantum Circuits: A Journey From Classical Circuits to Clifford Circuits and Beyond. In: Meel, K.S., Strichman, O. (eds.) SAT 2022. LIPIcs, vol. 236, pp. 18:1–18:17. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, Dagstuhl, Germany (2022). https://doi.org/10.4230/LIPIcs.SAT.2022.18 Berent, L., Burgholzer, L., Wille, R.: Towards a SAT Encoding for Quantum Circuits: A Journey From Classical Circuits to Clifford Circuits and Beyond. In: Meel, K.S., Strichman, O. (eds.) SAT 2022. LIPIcs, vol. 236, pp. 18:1–18:17. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, Dagstuhl, Germany (2022). https://​doi.​org/​10.​4230/​LIPIcs.​SAT.​2022.​18
19.
Zurück zum Zitat Marques, J.F., Varbanov, B.M., Moreira, M.S., Ali, H., Muthusubramanian, N., Zachariadis, C., Battistel, F., Beekman, M., Haider, N., Vlothuizen, W., Bruno, A., Terhal, B.M., DiCarlo, L.: Logical-qubit operations in an error-detecting surface code. Nature Physics 18(1), 80–86 (Dec 2021). https://doi.org/10.1038/s41567-021-01423-9 Marques, J.F., Varbanov, B.M., Moreira, M.S., Ali, H., Muthusubramanian, N., Zachariadis, C., Battistel, F., Beekman, M., Haider, N., Vlothuizen, W., Bruno, A., Terhal, B.M., DiCarlo, L.: Logical-qubit operations in an error-detecting surface code. Nature Physics 18(1), 80–86 (Dec 2021). https://​doi.​org/​10.​1038/​s41567-021-01423-9
21.
22.
Zurück zum Zitat Nielsen, M.A., Chuang, I.L.: Quantum Computation and Quantum Information: 10th Anniversary Edition. Cambridge University Press (2010) Nielsen, M.A., Chuang, I.L.: Quantum Computation and Quantum Information: 10th Anniversary Edition. Cambridge University Press (2010)
26.
Zurück zum Zitat Sistla, M., Chaudhuri, S., Reps, T.: Symbolic quantum simulation with quasimodo. In: Enea, C., Lal, A. (eds.) Computer Aided Verification. pp. 213–225. Springer Nature Switzerland, Cham (2023) Sistla, M., Chaudhuri, S., Reps, T.: Symbolic quantum simulation with quasimodo. In: Enea, C., Lal, A. (eds.) Computer Aided Verification. pp. 213–225. Springer Nature Switzerland, Cham (2023)
27.
Zurück zum Zitat Sutcliffe, M., Kissinger, A.: Fast classical simulation of quantum circuits via parametric rewriting in the ZX-calculus (2024) Sutcliffe, M., Kissinger, A.: Fast classical simulation of quantum circuits via parametric rewriting in the ZX-calculus (2024)
29.
Metadaten
Titel
Parallel Equivalence Checking of Stabilizer Quantum Circuits on GPUs
verfasst von
Muhammad Osama
Dimitrios Thanos
Alfons Laarman
Copyright-Jahr
2025
DOI
https://doi.org/10.1007/978-3-031-90660-2_6