Paper The following article is Open access

Greedy algorithm based circuit optimization for near-term quantum simulation

, , , , , , and

Published 30 June 2022 © 2022 The Author(s). Published by IOP Publishing Ltd
, , Citation Yi Hu et al 2022 Quantum Sci. Technol. 7 045001 DOI 10.1088/2058-9565/ac796b

2058-9565/7/4/045001

Abstract

Simulating quantum systems is believed to be one of the most important applications of quantum computers. On noisy intermediate-scale quantum (NISQ) devices, the high-level circuit designed by quantum algorithms for Hamiltonian simulation needs to consider hardware limitations such as gate errors and circuit depth before it can be efficiently executed. In this work, we develop a hardware-agnostic circuit optimization algorithm to reduce the overall circuit cost for Hamiltonian simulation problems. Our method employ a novel sub-circuit synthesis in intermediate representation and propose a greedy ordering scheme for gate cancellation to minimize the gate count and circuit depth. To quantify the benefits of this approach, we benchmark proposed algorithm on different Hamiltonian models. Compared with state-of-the-art generic quantum compilers and specific quantum simulation compiler, the benchmarking results of our algorithm show an average reduction in circuit depth by 16.5× (up to 64.1×) and in gate count by 7.8× (up to 23.7×). This significant improvement helps enhance the performance of Hamiltonian simulation in the NISQ era.

Export citation and abstract BibTeX RIS

Original content from this work may be used under the terms of the Creative Commons Attribution 4.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.

1. Introduction

While building a fault-tolerant quantum computer is still infeasible, recent progress in quantum hardware reveals a significant outperformance of a quantum processor on certain computational tasks which would take thousands of years on a state-of-the-art classical supercomputer [1]. Quantum computers are reaching the stage where specific problems can be achieved within desirable accuracy [2]. We are entering into the noisy intermediate-scale quantum (NISQ) era, which is characterized by quantum computers with noisy qubits range from 50 to 100 and lacking full scale quantum error correction [3]. In order to make the best use of current quantum hardware, several restrictions must be taken into account: (1) limited numbers of qubits, (2) connective constraints of qubits, (3) two-qubit gate errors and limited circuit depth resulted from coherent and incoherent errors. So the most important question is how to find useful applications of NISQ-era hardware. For reaching this goal, quantum algorithms that can be executed on current quantum devices are developed to deal with practical problems [410].

Quantum simulation is possibly to be one of the first practical applications of quantum computing, it has broad applications in quantum many-body physics [11] and quantum chemistry [1214]. Simulating the dynamics of quantum systems is extremely hard for classical computers as a result of the exponential growth of the system size. Quantum computers promise to efficiently solve this problem, however, the gate cost of simulating quantum systems is still prohibitive [15]. The main challenge for quantum simulation is to construct an efficient circuit that closely approximates the time evolution of the Hamiltonian of a quantum system.

Different approaches have been proposed for efficient quantum simulation, including product formulas [11, 1618], linear-combination-of-unitaries [19, 20], truncated Taylor series [21, 22], quantum walk [23], quantum signal processing [24], and multi-product formulas [25, 26]. Product formulas are utilized, as is most common, when the Hamiltonian can be decomposed as a sum of separate terms (H = ∑j Hj ), such that the time evolution of each Hj is easily implemented. Then the time evolution of the system can be described by $U={\mathrm{e}}^{-\mathrm{i}t{\sum }_{j}{H}_{j}}$. A standard approach in product formulas is Trotter–Suzuki decomposition

Equation (1)

this Trotter–Suzuki formula is referred to as first-order approximation. Each individual time evolution operator ${\text{e}}^{-\text{i}(t/k){H}_{j}}$ is made up of efficiently implementable quantum gates, which can be run on a quantum computer. The approximation errors, which scale as $\mathcal{O}\left({t}^{2}/k\right)$, arise from noncommutative terms in the Hamiltonian. If all the terms in the Hamiltonian commute, the exponent of sum of all terms is equal to products of the separate exponents (i.e., ${\mathrm{e}}^{-\mathrm{i}t{\sum }_{j}{H}_{j}}={\prod }_{j}{\mathrm{e}}^{-\mathrm{i}t{H}_{j}}$). If some terms do not commute, which naturally exists in physical systems, ${\left({\prod }_{j}\,{\text{e}}^{-\text{i}(t/k){H}_{j}}\right)}^{k}$ asymptotically approximate e−itH for large k. ${\prod }_{j}\,{\text{e}}^{-\text{i}(t/k){H}_{j}}$ is called one Trotter step, and the circuit will repeat k times.

To make Trotter error arbitrarily small, one can increase the repetition number. However, this also increases the depth of the circuit which is limited on NISQ devices. For saving the simulation cost without losing accuracy, one can reduce k by using high-order approximation. For instance, the second-order approximation

Equation (2)

can reduce k from $O\left({(tL{\Lambda})}^{2}/{\epsilon}\right)$ (first-order) to $O\left({(tL{\Lambda})}^{1+\frac{1}{2}}/{{\epsilon}}^{\frac{1}{2}}\right)$ [27, 28], where Λ is the magnitude of the strongest term and epsilon is the desired error threshold. An alternative way of decreasing cost is to find methods of reducing depth and gate count in one Trotter step, which is of most concern in this paper.

Here, we propose several techniques to address this problem. First of all, we develop a novel sub-circuit synthesis based on [29] to decompose k-local Hamiltonian in Pauli intermediate representation (IR). Then follows the technique which exploits the advantages of reordering Trotter sequences. By rearranging the individual terms in Trotter formula, we search for a superior ordering scheme that minimizes the circuit depth and the gate count. The method for reordering is based on greedy algorithm and provides great benefits in gate cancellation which contributes to the reduction of circuit depth. Further, we parallelize the operations, leaving the removed gate count unchanged but additionally reducing the circuit depth. Moreover, by combining the strategy proposed in [30] for reducing CNOT gates with our ordering scheme, we further decrease the amount of CNOT gates. The techniques we developed are applicable to quantum simulation kernels (i.e., the circuit that implement the operator of equation (1)), which appear in a wide range of algorithms [1114], and are independent of the underlying hardware.

This paper is organized as follows. In section 2 we introduce proposed sub-circuit synthesis in Pauli IR. In section 3 we explains how the error of Trotter–Suzuki approximation and the gate cancellation procedure depend on the Trotter ordering scheme, both of which influence the depth of overall circuit. Section 4 illustrates the proposed method for order search based on greedy algorithm. We introduce further optimizations of the circuit in section 5. Evaluation and benchmarking are exhibited in section 6. After discussion we conclude in section 7.

2. Sub-circuit synthesis in Pauli IR

For efficiently executing a quantum simulation circuit on NISQ devices, the Hamiltonian evolution operator in equation (1) need to be decomposed into executable single- and two-qubit operations, and compilers carry out this routine for every local Hamiltonians. The state-of-the-art generic compilers (e.g., Qiskit [31], t|ket⟩ [32]) will convert the multi-qubit operator into a sequence of their elementary gates such as single-qubit rotations and CNOT gates. Due to the lack of high-level IR of Pauli strings, they fail to leverage the optimization opportunities of synthesizing Pauli IR. Some compiler-level IR optimization frameworks such as Paulihedral [33] are designed to settle this problem, and quality improvement can be seen. Therefore, we capitalize on the intermediate optimization, proposing a novel sub-circuit synthesis in Pauli IR to better optimize quantum simulation kernels.

First, we introduce some analytic identities found by Clinton et al [29]. According to supplementary lemmas 7 and 8 in [29], for a Hamiltonian $H=\frac{1}{2\mathrm{i}}\left[{h}_{1},{h}_{2}\right]$ where h1 and h2 anti-commute and both square to identity, the evolution operator U(t) = eitH can be decomposed in two ways:

Equation (3)

or

Equation (4)

where t1, t2, ϕ are pulse times given by evolution time t. Note that equation (3) satisfies under conditions but equation (4) does not. Followed by the demonstration in [29], we can decompose an evolution operator eitH recursively using equation (4) decomposition then end with equation (3) decomposition, we apply this principle in our synthesis techniques.

Before introducing our intermediate circuit synthesis, let us define some notations that are used in this paper. The Hamiltonians in quantum simulation kernel we need to synthesize are expressed in the form of tensor product of Pauli matrices H = σ0σ1 ⊗ ⋯ ⊗ σn−1, where σi ∈ {I, X, Y, Z} describes the Pauli operator acting on the ith qubit, and n is number of qubits. For simplicity, in the rest of the paper we omit the identity operators and denote a time-evolution operator of the Hamiltonian by a Pauli string. For example, we denote the evolution operator $\mathrm{exp}\left(\mathrm{i}t{X}_{0}\otimes {Y}_{1}\otimes {Y}_{2}\otimes {I}_{3}\otimes {Z}_{4}\right)$ as X0 Y1 Y2 Z4 which corresponding to a four-local Hamiltonian X0Y1Y2Z4.

Following the decomposition schemes we described above, our intermediate circuit synthesis works as described below:

  • (a)  
    For a k-local Hamiltonian H, we decompose eitH using equation (4) while choosing h1 as a two-local Hamiltonian and h2 as a (k − 1)-local Hamiltonian then implement the same decomposition procedure for (k − 1)-local Hamiltonian iteratively until k − 1 = 3. For the rest three-local Hamiltonians, we apply equation (3) decomposition picking both h1 and h2 as two-local Hamiltonians so that all the decomposed operators are two-local interactions.
  • (b)  
    For each Hamiltonian need to be synthesized (k ⩾ 3), h1, h2 are computed by Pauli algebra.

We synthesize X0 Y1 Y2 Z4 as an example. In the first iteration, we rewrite X0Y1Y2Z4 as ${X}_{0}\otimes \left(\frac{1}{2\mathrm{i}}\left[{Z}_{1},{X}_{1}\right]\right)\otimes {Y}_{2}\otimes {Z}_{4}$ such that we can choose h1 = X0Z1, h2 = X1Y2Z4, then utilize the equation (4) decomposition.

In the second iteration, the procedure is repeated for h2 = X1Y2Z4, and we select ${h}_{1}^{\prime }={X}_{1}\otimes {Z}_{2},{h}_{2}^{\prime }={X}_{2}\otimes {Z}_{4}$. As a result, a k-local operator is decomposed into two-local operators. The circuit that explains this decomposition flow is depicted in figures 1 and 2.

Figure 1.

Figure 1. Intermediate circuit synthesis for X0 Y1 Y2 Z4 operator, the X1 Y2 Z4 operators are further decomposed in figure 2.

Standard image High-resolution image
Figure 2.

Figure 2. Intermediate circuit synthesis for X1 Y2 Z4 operator.

Standard image High-resolution image

After the proposed intermediate circuit synthesis, quantum simulation kernel is converted into two-local interaction sequences. The advantage of this synthesis is that it provides an opportunity for optimizing circuit by flexibly adjust the order of these two-local operators. The techniques will be described in the following sections.

3. Impact of Trotter terms ordering

The choice of ordering scheme has formerly been demonstrated to hold a great impact on the Trotterization [1214, 17, 18]. Specific tasks determine the utilization of the Trotter ordering scheme, which affects the performance of Trotterization in terms of the circuit length. In the context of quantum chemistry, a magnitude ordering is physically meaningful, due to the fact that terms with higher magnitude are more likely to correspond to stronger interactions. Utilizing this ordering scheme can significantly reduce the error in one Trotter step but is not optimal in gate count [12, 17, 18]. Alternatively, a lexicographic ordering is applied for saving as much gate cost as possible through cancellation, as it maximizes the similarity of adjacent terms [1214, 17, 18].

Choosing incompatible Trotter ordering schemes will differ in the length of the circuit. Minimizing the Trotter error may potentially acquire an order which is not optimal in terms of the circuit depth, where a lexicographic ordering holds a great advantage. Since we aim at optimizing the circuit of an arbitrary k-local Hamiltonian simulation, where the minimization of Trotter error which is useful to specific molecular Hamiltonians could possibly be outweighed by the effect of gate cancellation, we devote to addressing the latter one.

As the number of all possible orderings grows factorially with the amount of individual Trotter terms, it is difficult to find an optimal ordering scheme by performing an exhaustive search. However, for gate cancellation procedures, it is possible to find a superior order that minimizes the gate count as well as the circuit length, when the Hamiltonian is specified. Several approaches have been developed to this end [1214, 17, 18], but they are simply based on the lexicographic ordering. Combining the greedy algorithm with the basic lexicographic ordering, we establish a method which significantly outperforms the previous ones.

After the sub-circuit synthesis we have discussed above, quantum simulation kernel is divided into two-local interaction sequences. So we first introduce the decomposition of the evolution operators for two-local Hamiltonians, which helps illustrate our techniques. A two-qubit ZZ interaction operator can be decomposed as

Equation (5)

where j, k indicate the qubits on which the operator act. To make it more explicit, we give the gate level implementation as shown in figure 3.

Figure 3.

Figure 3. Circuit of a two-local ZZ interaction operator.

Standard image High-resolution image

This is one of the commonly used decompositions, which is also our choice, and we term the whole circuit in figure 3 as a two-local ZZ interaction gate.

Here, we state some preliminary knowledge. Considering a Hermitian operator M with its spectral decomposition $M=S{\Lambda}{S}^{{\dagger}}={\sum }_{j}\,{\lambda }_{j}\left\vert {m}_{j}\right\rangle \left\langle {m}_{j}\right\vert $, it holds that exponentiation of the operator M is equivalent to the sum of exponentiation of its eigenvalues,

Equation (6)

We focus on the diagonalization matrices D = S (i.e., DMD = Λ). Pauli Z operator is already diagonal, which means Dz = I. The other two Pauli operators X and Y can be diagonalized to Λ = Z with diagonalization matrices Dx and Dy separately. And it follows that

Equation (7)

Equation (8)

A general n-Pauli operator can be exponentiated in the same way by conducting tensor product of the diagonalization matrices corresponding to each of the terms, i.e.

Equation (9)

where P ∈ {X, Y, Z}. It implies that we can decompose any k-local interaction operators by this formula.

In the case of two-local interaction, for example, the Pauli XX and YY interaction operators can be decomposed to gate level as shown below in figures 4 and 5 (and likewise for other two-local interaction operators).

Figure 4.

Figure 4. Circuit of a two-local XX interaction operator.

Standard image High-resolution image
Figure 5.

Figure 5. Circuit of a two-local YY interaction operator.

Standard image High-resolution image

Here we choose diagonalization operators Dx = H and Dy = Rx (π/2) (where H is Hadamard transform operator and Rx (θ) is single-qubit rotation operator about the $\hat{x}$ axis, both of which are also corresponding quantum gates), but note it is not the only choice. The benefits of this form of decomposition are manifest, since the adjacent Dx and ${D}_{x}^{{\dagger}}$ (or Dy and ${D}_{y}^{{\dagger}}$) in the circuit will be canceled out immediately. It should be noted that all the diagonalization gates of a two-local interaction are possibly to be canceled out, which results in a significant reduction of the gate count. An example is shown in figure 7.

After rearranging the two-local gate sequences, the 10 diagonalization gates in the dashed box of figure 7 can all be canceled out, which will reduce the depth of circuit by 4. While with the ordering of original circuit in figure 6, the circuit depth can be reduced only by 2. As we can see in figure 7, the deposition of a two-local gate determines the cancellation on both the left side and the right side of it, and will affect the rest of the circuit. The problem of finding a superior order which maximizes the gate cancellation of the overall circuit is tricky. In the next section, we will illustrate our strategies to address this problem.

Figure 6.

Figure 6. Original circuit.

Standard image High-resolution image
Figure 7.

Figure 7. Circuit after reordering.

Standard image High-resolution image

4. Order search strategies

Since exhaustively searching all possible orders is impractical, alternatives need to be found. We cope with this problem by applying a greedy algorithm based heuristic search method. In our scenario, the reduction of gate count is the target of the optimization problem. Previous works have shown that a lexicographic ordering is beneficial for the purpose of gate cancellation. This ordering scheme intends to derive a maximum similarity of adjacent Pauli strings. But it is basically a numerical ordering either with respect to the fermionic operators or with respect to the individual Pauli operators [17, 18]. As we have shown above, the placement of a two-local gate influences the cancellation on both the left side and the right side of it. Merely sorting the Trotter sequences lexicographically may probably not give rise to the maximum amount of gate cancellation. Here we propose our ordering strategies and explain how it comes from greedy algorithm.

4.1. Trotter layers partition procedure

In the first stage of the algorithm, we partition the circuit into layers such that the reduction of gate count is locally optimal in each layer. That's where the greedy algorithm comes in. For reaching this goal, we develop several strategies.

First, we group two-local gates into different local Pauli-index pools in terms of the similarity of Pauli strings. Here we use a sequence X0 Y2 to denote a two-local gate ${\text{e}}^{\text{i}\theta {X}_{0}{Y}_{2}}$ (qubit indices start at 0). A sequence X0 X3 has the same Pauli string with X0 Y2 on qubit 0, so they are both grouped into the ${}_{{X}_{0}}$ index pool. X0 Y2 and Y2 Z3 will both be grouped into the ${\\}_{{Y}_{2}}$ index pool consequently. Note that a two-local gate is labeled by two Pauli-index pools.

Next, we pick out the gates from Pauli-index pools one by one for each layer, following a strategy that adjacent gates in one layer are chosen from one Pauli-index pool where the second index of last gate and the first index of next gate in. Subsequently, all the two-local gates in a single layer are arrayed in a stepped arrangement from smallest qubit number to the largest, so that the operations in this layer can be easily parallelized for further optimization. To make it concrete, we demonstrate a layout of one layer before parallelization in figure 8, where we chose sequences $\left[{Z}_{0}{X}_{1},{X}_{1}{Y}_{2},{Y}_{2}{X}_{3}\right]$ for this layer.

Figure 8.

Figure 8. The layout of one layer before parallelization, where the order of Trotter terms now is Z0 X1X1 Y2Y2 X3.

Standard image High-resolution image

These strategies we proposed are determined by the following two considerations. First, the diagonalization gates on the one side of a two-local gate in the intermediate qubits will be canceled out immediately (see dashed boxes in figure 8). Second, we can parallelize two operations in every three adjacent operations without changing the amount of removed gates (where operations are shown in figure 9 as numbered blocks), which maximizes the decrease of circuit depth.

Figure 9.

Figure 9. Parallelized circuit of one layer, where the order of Trotter terms in this layer is Z0 X1Y2 X3X1 Y2.

Standard image High-resolution image

In addition, we give priority to choosing the gates from Pauli-index pools where there are other gates remained in the corresponding pools can be chosen in the next layer, which means that the diagonalization gates of the two-local gates we chose can be canceled out with those in the next layer. And further, we select the gates that act on as much qubits as possible in a layer (i.e., $\left[{X}_{0}{X}_{1},{X}_{1}{Y}_{2},{Y}_{2}{Z}_{3},{Z}_{3}{X}_{4}\right]$ instead of $\left[{X}_{0}{X}_{1},{X}_{1}{Y}_{2},{Y}_{2}{Y}_{4}\right]$), which exploits the advantage of parallelization.

4.2. Gate allocation among Trotter layers

As for the third strategy, the gates in each layer are chosen from the Pauli-index pools in last layer (except for the first layer). Additionally, for the gates act on qubit 0, which are always on the top of a layer, we can simply select a lexicographic order among layers.

Since every diagonalization gate of a two-local gate can be canceled out at most once, regardless of which one is canceled out with, we move all the gate cancellation procedures ahead quickly as the reason of the third strategy. Explicitly, if two diagonalization gates on the left side of a two-local gate can both be canceled out with those on the right side of last layer, this two-local gate would be the first choice, and the case of only one diagonalization gate can be canceled out would be the second choice. As an example, the choice of gates as shown in the dashed boxes of figure 6 is outperformed by that on the corresponding position of figure 7.

Joining the adjacent gate cancellation procedures together, all the diagonalization gates that contribute to gate reduction are exhausted. That is, these locally optimal reductions, combined together, have resulted in a globally optimal solution in terms of the amount of removed gates. Furthermore, benefiting from the proposed ordering strategies, we can parallelize possible operations among adjacent layers without decreasing the amount of gate cancellation, which additionally reduces the circuit depth. In the last step of our ordering strategies, the rest of the two-local gates whose diagonalization gates cannot be canceled out are placed at the end of circuit with parallelization. The complete order search algorithm is depicted in algorithm 1.

Algorithm 1. Order search algorithm.

Input: a list of sequences of two-local interaction gates $[{\left({P}_{j}{P}_{k}\right)}^{(n)}]$
Output: a list of ordered sequences of two-local interaction gates ${[{\left({P}_{j}{P}_{k}\right)}^{(n)}]}_{\text{ordered}}$
  1: function OrderSearch ($[{\left({P}_{j}{P}_{k}\right)}^{(n)}]$)
  2: for all ${P}_{j}{P}_{k}\in {\left({P}_{j}{P}_{k}\right)}^{(n)}$ do
  3:  IndexPools: ${\\}_{{P}_{j}},{\\}_{{P}_{k}}{\leftarrow}{P}_{j}{P}_{k}$ (⊳) Group two-local gates Pj Pk into Pauli-index pools ${\\}_{{P}_{j}},{\\}_{{P}_{k}}.$
  4: end for
  5: ${[{\left({P}_{j}{P}_{k}\right)}^{(n)}]}_{\text{ordered}}{\leftarrow}\varnothing $ (⊳) Set ${[{\left({P}_{j}{P}_{k}\right)}^{(n)}]}_{\text{ordered}}$ to empty-list
  6: nextlayer ← true
  7: {}lastlayerpools ← ∅ (⊳) Store IndexPools in last layer
  8: CoIndex ← ∅ (⊳) Store concatenation index among adjacent gates in one layer
  9: repeat (⊳) Pick out two-local gates one by one for each layer to conduct gate cancellation procedure
 10: if nextlayer = true then (⊳) Choose first two-local gate of a layer
 11: for Pj Pk in ${\left({P}_{j}{P}_{k}\right)}^{(n)}$ do
 12: if Pj , Pk in {}lastlayerpools then
 13: ${[{\left({P}_{j}{P}_{k}\right)}^{(n)}]}_{\text{ordered}}{\leftarrow}{P}_{j}{P}_{k}$ (⊳) Append Pj Pk to ${[{\left({P}_{j}{P}_{k}\right)}^{(n)}]}_{\text{ordered}}$
 14: $[{\left({P}_{j}{P}_{k}\right)}^{(n)}]{\leftarrow}[{\left({P}_{j}{P}_{k}\right)}^{(n)}]-{P}_{j}{P}_{k}$, ${\\}_{\text{lastlayerpools}}{\leftarrow}{\\}_{{P}_{j}},{\\}_{{P}_{k}}$, CoIndex ← Pk , nextlayer ← False, break (⊳)Remove
Pj Pk from $[{\left({P}_{j}{P}_{k}\right)}^{(n)}]$, update last layer IndexPools with new ${\\}_{{P}_{j}},{\\}_{{P}_{k}}$, find CoIndex and jump out of FOR loop
 15: else if j = 0 then
 16: ${[{\left({P}_{j}{P}_{k}\right)}^{(n)}]}_{\text{ordered}}{\leftarrow}{P}_{0}{P}_{k}$
 17: $[{\left({P}_{j}{P}_{k}\right)}^{(n)}]{\leftarrow}[{\left({P}_{j}{P}_{k}\right)}^{(n)}]-{P}_{0}{P}_{k}$, ${\\}_{\text{lastlayerpools}}{\leftarrow}{\\}_{{P}_{0}},{\\}_{{P}_{k}}$, CoIndex ← Pk , nextlayer ← false, break
 18: end if
 19: end for
 20: else
 21: repeat (⊳) Choose the rest gates of a layer
 22: for CoIndex in {}CoIndex and Pj Pk in ${\left({P}_{j}{P}_{k}\right)}^{(n)}$ do
 23: if Pj = CoIndex and Pj , Pk in {}lastlayerpools then
 24: ${[{\left({P}_{j}{P}_{k}\right)}^{(n)}]}_{\text{ordered}}{\leftarrow}{P}_{j}{P}_{k}$
 25: $[{\left({P}_{j}{P}_{k}\right)}^{(n)}]{\leftarrow}[{\left({P}_{j}{P}_{k}\right)}^{(n)}]-{P}_{j}{P}_{k}$, ${\\}_{\text{lastlayerpools}}{\leftarrow}{\\}_{{P}_{j}},{\\}_{{P}_{k}}$, CoIndex ← Pk , break
 26: end if
 27: end for
 28: until nextlayer = True
 29:  end if
 30:  Reorder existent two-local sequences in ${[{\left({P}_{j}{P}_{k}\right)}^{(n)}]}_{\text{ordered}}$ for parallelization
 31: until no Pj Pk in rest $[{\left({P}_{j}{P}_{k}\right)}^{(n)}]$ contribute to gate cancellation
 32: append the rest two-local sequences to ${[{\left({P}_{j}{P}_{k}\right)}^{(n)}]}_{\text{ordered}}$ list with parallelization
 33: return ${[{\left({P}_{j}{P}_{k}\right)}^{(n)}]}_{\text{ordered}}$
 34: end function

5. Further optimizations

As we mentioned previously, error rates of the two-qubit gates on NISQ devices are non-negligible, which makes it vital to minimize the cost of two-qubit gates. In this section we discuss the second stage of our algorithm, which further optimizes the circuit by integrating the approach proposed in [30] into our gate reduction framework, for decreasing the amount of CNOT gates. To begin with, a briefly review of the approach we will employ is necessary.

In [30], the authors developed two hardware independent approaches for reducing the count of CNOT gates in the circuit. The first one is based on edge coloring (EC) method where parallelization of circuit takes priority, while the second one is based on depth first search (DFS) which holds an advantage in the amount of gate reduction. Both of the two approaches derived from the theorem 1 that they found as described in paper [30]. On account of the decomposition form we use, our two-qubit interaction operators match with the form of operators which can be optimized under the condition of theorem 1. That is to say, these two approaches can be grafted onto our optimization framework after the ordering and gate cancellation routine. Here we elucidate more details of implementation.

According to theorem 1 in [30], an operator V1 of the form

Equation (10)

can be replaced by

Equation (11)

under certain conditions, where the operator V1 is exactly our two-qubit ZZ interaction operator in equation (5) if we substitute γl with δ. (In our case, δ describes a small time-step of Trotterization.) Following the techniques they proposed, the first CNOT gate of corresponding operators in the first layer of our circuit can be eliminated. We demonstrate this with examples based on the DFS method and EC method respectively, as shown in figures 10 and 11.

Figure 10.

Figure 10. Optimized first layer of circuit after gate cancellation procedure and DFS CNOT reduction procedure, we used the same example as in figure 8 (i.e., $\left[{Z}_{0}{X}_{1},{X}_{1}{Y}_{2},{Y}_{2}{X}_{3}\right]$ for this layer).

Standard image High-resolution image
Figure 11.

Figure 11. Optimized first layer of circuit after gate cancellation procedure and EC CNOT reduction procedure, used the same example with figure 10.

Standard image High-resolution image

Utilizing the DFS based method for CNOT gates reduction, we can diminish the amount of CNOT gates up to N − 1, where N is the amount of qubits. In the case of EC based method, the gate count can be reduced up to N/2, but with lower circuit depth when the amount of qubits is large. The overall circuit optimization algorithm is summarized in algorithm 2.

Algorithm 2. Circuit optimization algorithm.

Input: original circuit of an arbitrary two-local Hamiltonian simulation
Output: optimized circuit of an arbitrary two-local Hamiltonian simulation
 1:if DFS path then
 2: execute order search routine without parallelizing operations in the first layer
 3: conduct gate cancellation procedure
 4: use DFS based CNOT gates reduction subroutine
 5:else if edge coloring path then
 6: execute the complete order search routine
 7: conduct gate cancellation procedure
 8: use edge coloring based CNOT gates reduction subroutine
 9: end if
10: output the optimized circuit

Since the cost of total gate count and circuit depth are both crucial indicators of the performance of an optimization algorithm on NISQ devices, we evaluate both of the two optimization pathways with various circuit models. The synthetic performances of them will be shown in the next section.

6. Evaluation

In this section, we exhibit results of some representative benchmarks for constructing baselines by which to compare the performance of our optimization algorithm with other state-of-the-art optimizers.

6.1. Experiment setup

Benchmarks: to cover as many varieties of applications as possible, we select different kinds of Hamiltonians as our benchmarks, including seven molecule Hamiltonians generated by PySCF [34] (HF, LiH, H2O, NH2, CH2, NH3, CH4) and VQE UCCSD ansatzes [7] of three sizes. We also prepare the Fermi–Hubbard models of four square lattice sizes, which are well known in condensed matter physics to describe the interactions between particles in a lattice. And finally we choose random Hamiltonians with the number of qubits range from 4 to 12 for a more comprehensive assessment. For the randomly generated Hamiltonians, we set the amount of Pauli strings increase polynomially with respect to the size of the qubits.

Metrics: we use the total gate count and depth of circuit as metrics to evaluate the performance of different quantum circuit optimizers. Due to the diverse circuit synthesis, different gate sets for representing a quantum circuit may give rise to different gate count and circuit depth. To establish a fair comparison, we compute the total gate count by adding up all the single-qubit gates and CNOT gates after circuit optimization, as these gates can be executed on NISQ devices directly. And the same for circuit depth.

Comparisons: we compare our circuit optimization algorithm with two generic state-of-the-art quantum compilers Qiskit [31] and t|ket⟩ [32], and a specific quantum simulation compiler Paulihedral [33]. Qiskit compile quantum simulation kernels into Ising type gates in its elementary gate library then apply the level 3 optimization, t|ket⟩ compiler has particular techniques based on simultaneous diagonalization for quantum simulation subroutine then follows 'full-pass' optimization. Paulihedral performs the block-wise optimization for quantum simulation kernels, it employs a specific Pauli IR that can preserve high-level semantics then conduct its block scheduling passes for gate cancellation utilizing lexicographic ordering. We compare both DFS and EC optimization paths with the above compilers.

6.2. Comparison result

Figure 12 shows the results of evaluation, note that due to the huge gap between DFS (EC) optimization and other compilers, the results of DFS and EC are difficult to read, it also reveals the great advantages of our optimization algorithm. Comparing with Qiskit, DFS (EC) optimization can reduce total gate count by 7.5× (7.5×), circuit depth by 15.4× (15.2×) on average. The reduction in total gate count and circuit depth when comparing with t|ket⟩ are 8.5× (8.5×) and 19.8× (19.5×) on average, with respect to DFS (EC) optimization. The reason of low efficiency in t|ket⟩ is probably the high overhead induced by simultaneous diagonalization technique. As we can see, generic compilers perform poorly in quantum simulation kernel optimization tasks when compared with our algorithm.

Figure 12.

Figure 12. Evaluation results of molecule Hamiltonians, UCCSD ansatzes, Fermi–Hubbard models, and random Hamiltonians. DFS and EC denote the depth first search and edge coloring optimization paths, which significantly outperform Paulihedral, Qiskit, and t|ket⟩ compilers.

Standard image High-resolution image

Next we turn to the comparison with Paulihedral, a specific quantum simulation compiler. The results in figure 12 exhibits that DFS (EC) optimization can still reduce total gate count by 7.5× (7.5×), and circuit depth by 14.3× (14×) on average, which means that our techniques for synthesizing quantum simulation kernels combined with greedy ordering scheme for gate cancellation greatly outperforms the lexicographic ordering scheme. The complete benchmarking results are presented in table 1.

Table 1. The complete benchmarking results of DFS, EC, Paulihedral, qiskit, and t|ket⟩.

 DFSECPaulihedralqiskittket
TotalDepthTotalDepthTotalDepthTotalDepthTotalDepth
HF10703381081362488429664810305958884525
LiH10703381081362488429664810305958884525
H2O15845121593511923452539466584495676712
NH2 1704560167056514 532837215 237927417 52913 166
CH2 1704560167056514 871857615 631954018 32013 774
NH3 2365743237478722 17912 51622 63913 39624 81418 255
CH4 2714852270685325 98214 63426 05815 75527 12419 113
UCCSD-828714529615111717611259874917761
UCCSD-1289626989529112 320780011 440750311 8119803
UCCSD-161622511162051132 53020 47033 53021 89838 46032 772
Fermi–Hubbard 2 × 2139591515816682188107182109
Fermi–Hubbard 3 × 3489118502128890421870453933561
Fermi–Hubbard 4 × 4941142924146183565617646991758845
Fermi–Hubbard 5 × 51491232150520429969683075120130561597
Random-4119781187916811014089174129
Random-5282149284150745504711493953654
Random-6420182426186169711671665118520981527
Random-7619223612225315221703123227339792712
Random-8826264842271553736685527397271734630
Random-91023329988312891357568870644311 3637394
Random-101174371113836314 139899714 06610 17617 45611 294
Random-111360431135143720 44512 88420 64214 91525 64616 272
Random-121589524156749829 71918 13829 70521 55036 57322 957

For the effect of DFS and EC optimization paths, it can be seen that in most cases their performances are similar, while in the situation of random generated Hamiltonians, EC optimization method can yields lower circuit depth when the size of circuits is large. It suggest that for a generic quantum simulation kernel with large interaction size, EC optimization method can be more practical.

In summary, our optimization algorithm outperforms Qiskit, t|ket⟩, and Paulihedral compilers in both total gate count and circuit depth, especially for the capability of depth reduction. For all benchmarks, DFS optimization algorithm can reduce the total gate count by 7.8× (up to 23.7×) and circuit depth by 16.5× (up to 64.1×) on average when comparing to other compilers. And EC optimization algorithm shows an average reduction in gate count by 7.8× (up to 23.7×) and in circuit depth by 16.3× (up to 64.1×) for all.

7. Conclusion

In this work, we developed a circuit optimization algorithm for Hamiltonian simulation problems. Distinct from the conventional circuit synthesis, we employ a novel sub-circuit synthesis in Pauli IR and combined with a greedy ordering scheme which explores the search space for an superior Trotter ordering scheme. By searching locally optimal Trotter orders and joining them together, we obtain an advantageous overall order which significantly decreases the gate count and circuit depth from gate cancellation. Moreover, we grafted the approach proposed in [29] onto our gate reduction framework for further optimization, which results in two optimization pathways, namely EC based optimization and DFS based optimization. Evaluation and benchmarking results showed that our algorithm significantly outperforms both state-of-the-art generic quantum compilers t|ket⟩ and Qiskit and specific quantum simulation compiler Paulihedral in terms of total gate count and circuit depth. For a large variety of quantum simulation kernels, DFS and EC optimization algorithm can substantially reduce the overall circuit cost. We believe our techniques will help enhance the performance of quantum simulation on NISQ devices and allow them to solve more practical problems.

Acknowledgments

This work was supported by the National Science Foundation of China (Nos. 61871111 and 61960206005) and the Fundamental Research Funds for the Central Universities (Nos. 2242022k30006 and 2242022k30001).

Data availability statement

The data that support the findings of this study are available upon reasonable request from the authors.

Please wait… references are loading.
10.1088/2058-9565/ac796b