Skip to main content
Erschienen in: Complex & Intelligent Systems 5/2023

Open Access 24.02.2023 | Original Article

BPLC + NOSO: backpropagation of errors based on latency code with neurons that only spike once at most

verfasst von: Seong Min Jin, Dohun Kim, Dong Hyung Yoo, Jason Eshraghian, Doo Seok Jeong

Erschienen in: Complex & Intelligent Systems | Ausgabe 5/2023

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

For mathematical completeness, we propose an error-backpropagation algorithm based on latency code (BPLC) with spiking neurons conforming to the spike–response model but allowed to spike once at most (NOSOs). BPLC is based on gradients derived without approximation unlike previous temporal code-based error-backpropagation algorithms. The latency code uses the spiking latency (period from the first input spike to spiking) as a measure of neuronal activity. To support the latency code, we introduce a minimum-latency pooling layer that passes the spike of the minimum latency only for a given patch. We also introduce a symmetric dual threshold for spiking (i) to avoid the dead neuron issue and (ii) to confine a potential distribution to the range between the symmetric thresholds. Given that the number of spikes (rather than timesteps) is the major cause of inference delay for digital neuromorphic hardware, NOSONets trained using BPLC likely reduce inference delay significantly. To identify the feasibility of BPLC + NOSO, we trained CNN-based NOSONets on Fashion-MNIST and CIFAR-10. The classification accuracy on CIFAR-10 exceeds the state-of-the-art result from an SNN of the same depth and width by approximately 2%. Additionally, the number of spikes for inference is significantly reduced (by approximately one order of magnitude), highlighting a significant reduction in inference delay.
Hinweise
Seong Min Jin and Dohun Kim have contributed equally to this work.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Introduction

Spiking neural networks (SNNs) of layer-wise feedforward structure can process and convey data forward based on asynchronous spiking events without forward locking unlike feedforward deep neural networks (DNNs) [10, 32]. When implemented in asynchronous neuromorphic hardware, SNNs are believed to leverage their processing efficiency. Nevertheless, asynchronous neuromorphic hardware often suffers from traffic congestion due to a large number of spikes (events) that are routed to their destination neurons through network-on-chip with limited bandwidth [9]. In this regard, the number of synaptic operations per second (SynOPS) is considered as a crucial measure of neuromorphic hardware performance, and attempts have been made to improve this synaptic operation speed to further accelerate the inference process [8, 12, 27, 28]. Algorithm-wise approaches to improve the inference speed include the development of learning algorithms that support the inference process using fewer spikes.
Given the limited accessibility to global data in multicore neuromorphic hardware, learning algorithms of locality are favored as on-chip learning algorithms. However, learning algorithms of locality, e.g., naive Hebb rule [15], spike timing-dependent plasticity [4], and Ca-signaling model [21], fail to achieve high performance. Currently, it appears that the trend is moving toward off-chip learning, allowing the learner to access large global data within the general framework of error-backpropagation (backprop). The advantage is such that enriched optimization techniques for DNNs can readily be applied to SNNs, which significantly improves the performance of SNNs [10]. Nevertheless, the notable inconsistency between DNNs and SNNs lies in the fact that output spikes are non-differentiable unlike activation functions.
As a workaround, the gradients of spikes are often approximated to heuristic functions, which are popularly referred to as surrogate gradients [2, 11, 34, 38, 44]. Using surrogate gradients, the gradient values are available disregarding the presence of events, avoiding the dead neuron issue that hinders the network from learning. To date, various surrogate gradients have been proposed, e.g., boxcar function [38], arctan function [11], exponential function [34]; these methods remove the inconsistency between DNNs and SNNs, yielding the state-of-the-art classification accuracy on various datasets. Despite the technical success, such heuristic surrogate gradient methods lack theoretical completeness given the lack of theoretical foundations of surrogate gradients.
Spike timing-based backprop (temporal backprop) algorithms can avoid such surrogate gradients because the spike timing may be differentiable with the membrane potential using a linear approximation of near-threshold potential evolution [5]. Temporal backprop is generally prone to learning failure because of limited error-backpropagation paths. This is because spike timing gradients are available only for the neurons that spike at a given timestep unlike surrogate gradients. The number of error-backpropagation paths is further limited by dead neurons, i.e., neurons whose current fan-in weights are low so that they no longer fire spikes. STDBP, a temporal backprop algorithm, uses a rectified linear potential kernel to avoid the dead neuron issue [46]. The rectified linear kernel causes a monotonous increase in potential upon receiving an input spike with a positive weight, suggesting that the neurons eventually fire spikes. TSSL-BP considers additional error-backprogation paths via spikes from the same neuron to avoid learning failure due to limited error-backpropagation paths [48]. The timing gradient is calculated using the linear approximation by Bohte et al. [5]. Another temporal backprop algorithm (STiDi-BP) uses a piece-wise linear kernel to approximate the spike timing gradient to a simple function, and thus to reduce the computational cost [25, 26].
Because spike timing gradients are available only for the neurons that spike, generally, the larger the number of spikes, the richer the error-backpropagation paths. Thus, more spikes are desired for a better training. However, this causes a considerable inference delay when implemented in digital neuromorphic hardware because of its limited synaptic operation speed. Concerning the desires for
  • theoretically seamless applications of temporal backprop to SNNs,
  • workaround for the dead neuron issue,
  • fewer spikes for fast inference,
we propose a novel learning algorithm based on the spiking latency code of neurons that only spike once at most (NOSOs). NOSOs are based on the spike–response model (SRM) [13] but with an infinite hard refractory period to avoid additional spikes. The algorithm is based on the backpropagation of errors evaluated using the spiking latency code (BPLC). The key to BPLC + NOSO is such that, when spiking, spiking latency (rather than spike itself) is the measure of the response to a given input, which is differentiable without approximations unlike [5]. Thus, BPLC + NOSO is mathematically rigorous such that all required gradients are derived analytically. Other important features of BPLC + NOSO are as follows.
  • The use of NOSOs for both learning and inference minimizes the workload on the event-routing circuits in neuromorphic hardware.
  • To support the latency code, NOSONet includes minimum-latency pooling (MinPool) layers (instead of MaxPool or AvgPool) that pass the event of the minimum latency only for a given patch.
  • Each NOSO is given two symmetric thresholds (\(-\vartheta \) and \(\vartheta \)) for spiking to confine the potential distribution to the range between the symmetric thresholds.
  • BPLC + NOSO fully supports both folded and unfolded NOSONets, allowing us to use the automatic differentiation framework [31].
The primary contributions of this study include the following:
  • We introduce a novel learning algorithm based on the spiking latency code (BPLC + NOSO) with full derivations of the primary gradients without approximations.
  • We provide novel and essential methods for BPLC + NOSO support, such as MinPool layers and symmetric dual threshold for spiking, which greatly improve accuracy and inference efficiency.
  • We introduce a method to quickly calculate wallclock time for inference on general digital neuromorphic hardware, which allows a quick estimation of the inference delay for a given fully trained SNN.
The rest of the paper is organized as follows— Section “Related work” briefly overviews previous learning algorithms based on temporal codes. Section “Preliminaries” addresses primary techniques employed in BPLC + NOSO. Section “BPLC with spike response model” is dedicated to the theoretical foundations of BPLC + NOSO. Section “Experiments” addresses the performance evaluation of BPLC + NOSO on Fashion-MNIST and CIFAR-10 and effects of MinPool and symmetric dual threshold for spiking on learning efficacy. Section “Discussion” discusses the estimation of inference time for an SNN mapped onto a general digital multicore neuromorphic processor. Finally, Section “Conclusion and outlook” concludes our study.
Table 1
Acronyms and symbols
Acronyms or Symbols
Description
IF
Integrate-and-fire
LIF
Leaky integrate-and-fire
SRM
Spike response model
BPLC
Error-backpropagation algorithm based on latency code
NOSO
Neurons that only spike once at most
MinPool
Minimum-latency pooling
SynOPS
Synaptic operations per second
TTFS
Time to the first spike
\(\varvec{T}_\text {lat}^{(L)}\)
Spiking latency of output neurons in the output layer L
\(\hat{t}^{(l)}_{i}\)
Spike timing of the ith neuron in the lth layer
\(t^{(l)}_{\text {in}, i}\)
First input spike timing for the ith neuron in the lth layer
\(w_{ij}^{(l)}\)
Synaptic weight from the jth neuron in the (l-1)th layer
\(u_{i}^{(l)}\)
Membrane potential of the ith neuron in the lth layer
\(v_{j}^{(l)}\)
Membrane potential before weight multiplication of jth neuron in the lth layer
\(N_\text {n}\)
Number of neurons in a network
\(N_\text {c}\)
Number of cores of neuromorphic processor
\(T_{\text {up}}\)
Time for process of multiplying the current potential by decay factor
\(T_{\text {sop}}\)
Time for synaptic operations at each timestep
\(T_{\text {inf}}\)
Inference delay
Spike timing gradient approximation: Temporal backprop algorithms frequently use linear approximated spike timing gradients proposed by Bohte et al. [5]. The specific form of the gradient depends on the membrane potential kernel used. Bohte et al. [5], Comsa et al. [7], and Kim et al. [19] used an alpha kernel as an approximation of the genuine SRM kernel, and the corresponding gradients were evaluated using the linear approximation. Zhang et al., employed a rectified linear kernel to avoid the dead neuron issue [46] while Mirsadeghi et al., employed a piece-wise linear kernel for simple calculations of the gradient [25, 26]. To apply the linear approximation by Bohte et al. [5], the gradient of membrane potential at the spike timing should be available. Integrate-and-fire (IF) neurons do not allow the gradient value at the spike timing so that Kheradpisheh and Masquelier [17] approximated the gradient to be constant at –1. The same holds for leaky integrate-and-fire (LIF) neurons. Zhang and Li [48] stated that the linear approximated was employed, but the gradient is not clearly derived.
Label-encoding as spike timings: For SNN with temporal code, the correct labels are frequently encoded as particular output spike timings [17, 25, 26] or the temporal order of output spikes such as time-to-first-spike (TTFS) code [30, 45, 46]. In the TTFS code, the neuron index of the first output spike indicates the output label.
Workaround for dead neuron: Comsa et al. proposed temporal backprop with a means to avoid dead neurons (assigning penalties to the presynaptic weights of each dead neuron) [7]. Zhang et al. [46] proposed a rectified linear potential kernel that causes a monotonous increase in potential upon receiving a spike with positive weight. Thus, the neuron eventually fire a spike. Zhang and Li [48] proposed TSSL-BP with additional backprop paths via the spikes emitted from the same neuron (intra-neuron dependency). The additional paths avoid the learning failure due to limited backprop paths by dead neurons. Kim et al. [19] combined temporal backprop paths with rate-based backprop paths to compensate for the loss of temporal backprop paths due to dead neurons.
BPLC + NOSO is clearly distinguished from the previous temporal backprop algorithms in terms of the primary perspectives addressed in this section. First, BPLC + NOSO employs no approximation for gradient evaluation unlike the previous temporal backprop algorithms including those reviewed in this section. Therefore, it barely embodies ambiguity. Second, the proposed spiking latency code is a novel data encoding scheme, distinguishable from the previous temporal code schemes. Third, the symmetric dual threshold for spiking is a novel method to avoid the dead neuron issue, which is computationally efficient since it hardly involves high-cost computations. Additionally, BPLC + NOSO is fully compatible with the original SRM without approximations.

Preliminaries

Latency code

Spiking latency is a period from the first input spike timing \(t_\text {in}\) and consequent spike timing \(\hat{t}\) as illustrated in Fig. 1a. In the latency code, NOSONet encodes input data \(\varvec{x}\) as the spiking latency \(T_\text {lat}^{(L)}\) of the output neurons in the output layer L.
$$\begin{aligned} \varvec{T}_\text {lat}^{(L)} = \hat{\varvec{t}}^{(L)} - \varvec{t}_\text {in}^{(L)} = f^{(L)}(\hat{\varvec{t}}^{(L-1)};\varvec{w}^{(L-1)}), \end{aligned}$$
(1)
where \(\hat{\varvec{t}}^{(\cdot )}\) and \(\varvec{t}^{(\cdot )}_\text {in}\) denote the spike timings of the neurons in the \((\cdot )\)th layer and their first input spike timings, respectively. The function \(f^{(L)}\) encodes input spikes (from the layer L-1) at \(\hat{\varvec{t}}^{(L-1)}\) as spiking latency values \(\varvec{T}_\text {lat}^{(L)}\). The larger the weight \(\varvec{w}^{(L-1)}\), the shorter the spiking latency \(\varvec{T}_\text {lat}^{(L)}\). This latency code should be distinguished from the TTFS code [30, 45, 46] in which the first input spike timings \(\varvec{t}_\text {in}^{(L)}\) in Eq. (1) are ignored, so that it considers the output spike timings only.

Minimum-latency pooling

The MinPool layers support the latency code. Consider the time elapsed since the first input spike, \(t_\text {elap}=t-t_\text {in}\), for a given neuron. We consider a spiking latency map in a given 2D patch \(\mathcal {D}_\text {pool}\) at timestep t and feature (spike) map in the same patch, \(\varvec{T}_{\text {lat,}\mathcal {D}_\text {pool}}[t]\) and \(\varvec{s}_{\mathcal {D}_\text {pool}}[t]\), respectively. The latency map \(\varvec{T}_{\text {lat},\mathcal {D}_\text {pool}}\) is initialized to infinite values. Each element in the map is replaced by real spiking latency when the neuron spikes. Note that the elements once replaced by real latency values are no longer overwritten because of the use of NOSOs. At time step t, MinPool outputs one if the neuron of the smallest spiking latency in the patch fires a spike, and zero otherwise.
$$\begin{aligned}&x_\text {min} = {{\,\mathrm{arg\,min}\,}}_{x\in \mathcal {D}_\text {pool}}\left\{ \varvec{T}_{\text {lat},\mathcal {D}_\text {pool}}[t]\right\} \text {,}\nonumber \\&\text {MinPool}\left( \mathcal {D}_\text {pool}\right) \left[ t\right] = s_{x_\text {min}}\left[ t\right] , \end{aligned}$$
(2)
where \(s_{x_\text {min}}[t]\) indicates the spike function value for \(x_\text {min}\) at timestep t. An example of \(\text {MinPool}\left( \mathcal {D}_\text {pool}\right) \left[ t\right] \left( =1\right) \) is illustrated in Fig. 1b.

NOSO with dual threshold for spiking

Each NOSO is endowed with a symmetric dual threshold for spiking (\(-\vartheta \) and \(\vartheta \)), and thus a spike is generated if the membrane potential u satisfies \(u\ge \vartheta \) or \(u\le -\vartheta \). Therefore, the subthreshold potential u is confined to the range between \(-\vartheta \) and \(\vartheta \). The restriction on the potential offers the upper limit of potential variance over the samples in a given batch, preventing large potential variance over the samples. The symmetry in the two bounds may offer zero-mean potential over the samples. Additionally, the restriction on the potential is expected to avoid dead neurons given that most dead neurons arise from their potentials largely biased toward the negative side.

BPLC with spike response model

Spike response model mapped onto computational graphs

We consider SRM, which is equivalent to the basic leaky integrate-and-fire (LIF) model with exponentially decaying synaptic current [13]. But our model is allowed to maximally spike only once in response to a single input sample by using an infinite hard refractory period in place of the refractory kernel. The choice of SRM, rather than simpler models, e.g., Stein’s model [35], is to enlarge the mutual information of spike timing and synaptic weight, which is the key to temporal code.
In SRM, the subthreshold potential of the ith spiking neuron in the lth layer (\(u_{i}^{(l)}\)) is given by
$$\begin{aligned} u_{i}^{(l)}\left[ t\right] = \sum _{j} w_{ij}^{(l)}\left( \epsilon *s_{j}^{(l-1)} \right) \left[ t\right] sav_i^{(l)}\left[ t\right] , \end{aligned}$$
(3)
where j denotes the indices of the presynaptic neurons, and \(w_{ij}^{(l)}\) denotes the synaptic weight from the jth neuron in the (l-1)th layer. The spiking-availability function \(sav_i^{(l)}\) is employed to allow each neuron to spike once at most such that \(sav_i^{(l)}=1\) if the neuron has not spiked before, and \(sav_i^{(l)}=0\) otherwise. The kernel \(\epsilon \) is expressed as follows [13].
$$\begin{aligned} \epsilon =\frac{\tau _{m}}{\tau _{m}-\tau _{s}}\left( e^{-t\mathbin {/}\tau _{m}} - e^{-t\mathbin {/}\tau _{s}}\right) \Theta \left[ t\right] , \end{aligned}$$
(4)
where \(\Theta \) denotes the Heaviside step function. The potential and synaptic current time constants are denoted by \(\tau _{m}\) and \(\tau _s\), respectively. A spike from the jth neuron in the (l-1)th layer at \(\hat{t}_j^{(l-1)}\) is denoted by \(s_{j}^{(l-1)}\). Because the kernel in Eq. (4) consists of two independent sub-kernels,
$$\begin{aligned} \epsilon _{\left( \cdot \right) } = \dfrac{\tau _m}{\tau _m-\tau _s}e^{-t/\tau _{\left( \cdot \right) }}\Theta \left[ t\right] \text {, where } (\cdot )\in \left\{ m, s\right\} , \end{aligned}$$
(5)
Eq. (3) can be expressed as
$$\begin{aligned} u_i^{(l)}\left[ t\right]&= \left( u_{i,m}^{(l)}\left[ t\right] -u_{i,s}^{(l)}\left[ t\right] \right) sav_i^{(l)}\left[ t\right] \text {,}\\\nonumber u_{i,(\cdot )}^{(l)}\left[ t\right]&= \sum _{j}\dfrac{\tau _{m}w_{ij}^{(l)}}{\tau _{m}-\tau _{s}} e^{-\left( t-\hat{t}_{j}^{(l-1)}\right) \mathbin {/}\tau _{\left( \cdot \right) }} \Theta \left[ t-\hat{t}_{j}^{(l-1)}\right] \text {,} \\&\quad \text { where } (\cdot )\in \left\{ m, s\right\} .\nonumber \end{aligned}$$
Here, we introduce a new variable \(v_j^{(l)}\) given by
$$\begin{aligned} v_j^{(l)}\left[ t\right]&= v_{j,m}^{(l)}\left[ t\right] -v_{j,s}^{(l)}\left[ t\right] \text {,}\\\nonumber v_{j,(\cdot )}^{(l)}\left[ t\right]&= \dfrac{\tau _{m}}{\tau _{m}-\tau _{s}} e^{-\left( t-\hat{t}_{j}^{(l-1)}\right) \mathbin {/}\tau _{\left( \cdot \right) }} \Theta \left[ t-\hat{t}_{j}^{(l-1)}\right] \text {,}&\\&\quad \text { where } (\cdot )\in \left\{ m, s\right\} .\nonumber&\end{aligned}$$
The variables \(u_{i,m}^{(l)}\) and \(u_{i,s}^{(l)}\) are reset to zero when the neuron fires a spike. The advantage of this method is that the membrane potential can be evaluated by simply convolving input spikes using two independent kernels, which otherwise needs to solve two sequential differential equations [20]. After spiking, the spiking-availability function \(sav_i^{(l)}\) remains constant at zero, hindering additional spike generation.
All variables are recursively evaluated using the explicit finite difference method.
$$\begin{aligned} v_{j,(\cdot )}^{(l)}\left[ t+1\right]&= v_{j,(\cdot )}^{(l)}\left[ t\right] e^{-1/\tau _{(\cdot )}} + \dfrac{\tau _m}{\tau _m-\tau _s}s_j^{(l-1)}\left[ t+1\right] \text {,}\nonumber \\&\quad \text { where } (\cdot )\in \left\{ m, s\right\} ,\nonumber \\ u_{i,(\cdot )}^{(l)}\left[ t+1\right]&= \sum _{j} w_{ij}^{(l)}v_{j,(\cdot )}^{(l)}\left[ t+1\right] \text {, where } (\cdot )\in \left\{ m, s\right\} ,\nonumber \\ u_i^{(l)}\left[ t+1\right]&= \left( u_{i,m}^{(l)}\left[ t+1\right] - u_{i,s}^{(l)}\left[ t+1\right] \right) sav_i^{(l)}\left[ t+1\right] . \end{aligned}$$
(6)
Equation (6) can be mapped onto a computational graph as shown in Fig. 2. A layer’s processed data is transmitted along the forward pass through the use of spikes (\(\varvec{s}^{(l)}\)).

Backward pass and gradients

SNNs are typically trained using forward and backward passes aligned in opposing directions, so that it is unavoidable to use surrogate gradients due to non-differentiability of spikes [29, 34, 44]. Instead, BPLC + NOSO uses a backward pass via spike timings \(\hat{\varvec{t}}^{(\cdot )}\) rather than spikes themselves \(\varvec{s}^{(\cdot )}\) (Fig. 2). This backward pass involves differentiable functions only. The output of NOSONet (with M output NOSOs) is the spiking latency values of the output NOSOs, \(\varvec{T}_\text {lat}^{(L)}=\{T^{(L)}_{\text {lat}, i}\}_{i=1}^{M}\), as given in Eq. (1). The prediction is then made by reference to the output neuron of the minimum spiking latency. We use a cross-entropy loss function \(\mathcal {L}(-\varvec{T}_\text {lat}^{(L)}, \varvec{\hat{y}})\), where \(\varvec{\hat{y}}\) denotes a one-hot encoded label vector. The loss is evaluated at the end of the learning phase, and the weights are then updated using the gradients assessed when the neurons spiked.
We calculate the weight’s update \(\Delta w_{ij}^{(l)}\) using the gradient descent method as follows.
$$\begin{aligned} \Delta w_{ij}^{(l)}= & {} -\eta \dfrac{\partial \mathcal {L}}{\partial \hat{t}_i^{(l)}}\dfrac{{\partial \hat{t}_i^{(l)}}}{\partial u_i^{(l)}}\dfrac{\partial u_i^{(l)}}{\partial w_{ij}^{(l)}}\left[ \hat{t}_i^{(l)}\right] \nonumber \\ {}= & {} -\eta \dfrac{\partial \mathcal {L}}{\partial \hat{t}_i^{(l)}}\dfrac{{\partial \hat{t}_i^{(l)}}}{\partial u_i^{(l)}}v_j^{(l)}\left[ \hat{t}_i^{(l)}\right] \text {.} \end{aligned}$$
(7)
The learning rate and loss function are denoted by \(\eta \) and \(\mathcal {L}\), respectively. Equation (7) is equivalent to
$$\begin{aligned} \Delta \varvec{w}^{(l)} = -\eta diag\left( \varvec{e}^{(l)}\right) \varvec{v}^{(l)}\left[ \varvec{\hat{t}}^{(l)}\right] , \end{aligned}$$
(8)
with the error \(\varvec{e}^{(l)}\) given by
$$\begin{aligned} \varvec{e}^{(l)}&= \nabla _{\varvec{\hat{t}}^{(l)}}\mathcal {L}\odot \hat{\varvec{t}}^{(l)'},\nonumber \\ \nabla _{\varvec{\hat{t}}^{(l)}}\mathcal {L}&= \left[ \dfrac{\partial \mathcal {L}}{\partial \hat{t}_i^{(l)}},\cdots , \dfrac{\partial \mathcal {L}}{\partial \hat{t}_N^{(l)}}\right] ^\textrm{T},\nonumber \\ \hat{\varvec{t}}^{(l)'}&= \left[ \dfrac{\partial \hat{t}_1^{(l)}}{\partial u_1^{(l)}}, \cdots , \dfrac{\partial \hat{t}_N^{(l)}}{\partial u_N^{(l)}}\right] ^\textrm{T}, \end{aligned}$$
(9)
for N neurons in the lth layer. The symbol \(\odot \) denotes the Hadamard product. The matrix \(\varvec{v}^{(l)}[\varvec{\hat{t}}^{(l)}]\) is given by
$$\begin{aligned}\nonumber \varvec{v}^{(l)}\left[ \varvec{\hat{t}}^{(l)}\right] = \begin{bmatrix} v^{(l)}_1\left[ \hat{t}^{(l)}_1\right] &{} \ldots &{} v^{(l)}_M\left[ \hat{t}^{(l)}_1\right] \\ \vdots &{} \ddots &{} \vdots \\ v^{(l)}_1\left[ \hat{t}^{(l)}_N\right] &{} \ldots &{} v^{(l)}_M\left[ \hat{t}^{(l)}_N\right] \end{bmatrix}, \end{aligned}$$
for M neurons in the \(\left( l-1\right) \)th layer.
Table 2
Classification accuracy and the number of spikes used for inference
Method
Network
Coding
Best accuracy
Average Accuracy
\(\#\)spikes \(N_{\text {sp}}\)
Fashion-MNIST
Ikegawa et al. [16]
16C3-\(\{\)32C3\(\}\)*6-\(\{\)64C3\(\}\)*5
Rate
89.10
7156K
Zhang et al. [46]
16C5-P2-32C5-P2-800-128
Temporal
90.10
Zhang et al. [47]
400-R400
Rate
90.13
90.00±0.14
Sun et al. [36]
32C3-P2-32C3-P2-128
Rate
91.56
12K (only Conv)
Cheng et al. [6]
32C3-P2-32C3-P2-128
Rate
92.07
Mirsadeghi et al. [25]
20C5-P2-40C5-P2-1000
Temporal
92.80
Zhang and Li. [48]
32C5-P2-64C5-P2-1024
Temporal
92.83
92.69±0.09
Zhao et al. [49]
32C5-P2-64C5-P2-1024
Rate
93.45
93.04±0.31
BPLC + NOSO
32C5-P2-64C5-P2-600
Latency
92.47
92.44±0.02
14K±0.26K
CIFAR-10
Wu et al. [39]
CNN1\(^{*}\)
Rate
85.24
Wu et al. [39]
CNN2\(^{**}\)
Rate
90.53
Wu et al. [39]
CNN2-half-ch
Rate
87.80
1298K
Zhang and Li. [48]
CNN1
Temporal
89.22
Zhang and Li. [48]
CNN2
Temporal
91.41
308K
Tan et al. [37]
CNN1
Modified rate
89.57
412K
Tan et al. [37]
CNN2
Modified rate
90.13
342K
Zhao et al. [49]
CNN3\(^{***}\)
Rate
90.93
Lee et al. [23]
ResNet11
Rate
90.95
1530K
BPLC + NOSO
CNN4\(^{****}\)
Latency
89.77
89.37±0.25
142K±1.86K
\(^*\)96C3-256C3-P2-384C3-P2-384C3-256C3-1024-1024
\(^{**}\)128C3-256C3-P2-512C3-P2-1024C3-512C3-1024-512
\(^{***}\)128C3-P2-256C3-P2-512C3-P2-1024
\(^{****}\)64C5-128C5-P2-256C5-P2-512C5-256C5-1024-512
The backward propagation of the error from the lth layer to the \(\left( l-1\right) \)th layer (with M neurons) is given by
$$\begin{aligned} \varvec{e}^{(l-1)}&= \left( \varvec{w}^{(l)\mathrm T}\odot \varvec{v}^{(l)'}\left[ \varvec{\hat{t}}^{(l)}\right] \right) \varvec{e}^{(l)}\odot \hat{\varvec{t}}^{(l-1)'},\nonumber \\ \varvec{v}^{(l)'}\left[ \varvec{\hat{t}}^{(l)}\right]&= \begin{bmatrix} \dfrac{\partial v^{(l)}_1}{\partial \hat{t}^{(l-1)}_1}\left[ \hat{t}^{(l)}_1\right] &{} \ldots &{} \dfrac{\partial v^{(l)}_1}{\partial \hat{t}^{(l-1)}_1}\left[ \hat{t}^{(l)}_N\right] \\ \vdots &{} \ddots &{} \vdots \\ \dfrac{\partial v^{(l)}_M}{\partial \hat{t}^{(l-1)}_M}\left[ \hat{t}^{(l)}_1\right] &{} \ldots &{} \dfrac{\partial v^{(l)}_M}{\partial \hat{t}^{(l-1)}_M}\left[ \hat{t}^{(l)}_N\right] \end{bmatrix}. \end{aligned}$$
(10)
Equation (10) is derived in Appendix A. Because NOSO spikes once at most, the elements once written in \(\varvec{v}^{(l)}[\varvec{\hat{t}}^{(l)}]\) and \(\varvec{v}^{(l)'}[\varvec{\hat{t}}^{(l)}]\) are not overwritten. Equation (10) identifies that BPLC involves the gradients of spike timings rather than spikes themselves. Therefore, the backward pass differs from the forward pass.
Two types of gradients are thus required for BPLC + NOSO: (i) \(\partial \hat{t}_i^{(l)}/\partial u_i^{(l)}\) and (ii) \(\partial v_j^{(l)}/\hat{t}_j^{(l-1)}\) at the spike timing \(\hat{t}_i^{(l)}\). Fortunately, SRM allows these gradients to be expressed analytically.
Theorem 1
When an SRM neuron (whose membrane potential is \(u_i^{(l)}\)) spikes at a given time \(t (=\hat{t}_i^{(l)})\), the gradient of spike timing \(\hat{t}_i^{(l)}\) with membrane potential is given by
$$\begin{aligned} \dfrac{\partial \hat{t}_i^{(l)}}{\partial u_{i}^{(l)}} = \left( u_{i,m}^{(l)}\left[ \hat{t}_i\right] /\tau _m - u_{i,s}^{(l)}\left[ \hat{t}_i\right] /\tau _s\right) ^{-1}. \end{aligned}$$
(11)
The proof of Theorem 1 is given in Appendix  B. If the neuron does not spike during a learning phase, the gradient in Eq. (11) is zero.
Theorem 2
When an SRM neuron receives an input spike at \(\hat{t}_j^{(l-1)}\), the gradients of \(v_{j,m}^{(l)}\) and \(v_{j,s}^{(l)}\) with respect to \(\hat{t}_j^{(l-1)}\) are given by
$$\begin{aligned} \dfrac{\partial v_{j,\left( \cdot \right) }^{(l)}}{\partial \hat{t}_j^{(l-1)}}\left[ t\right]&= \dfrac{\tau _m}{\tau _{\left( \cdot \right) }\left( \tau _m - \tau _s\right) }e^{-\left( t-\hat{t}_{j}^{(l-1)}\right) \mathbin {/}\tau _{\left( \cdot \right) }}\Theta \left[ t-\hat{t}_{j}^{(l-1)}\right] \nonumber \\&= \dfrac{v_{j,\left( \cdot \right) }^{(l)}\left[ t\right] }{\tau _{\left( \cdot \right) }} \text {, where } (\cdot )\in \left\{ m, s\right\} . \end{aligned}$$
(12)
The proof of Theorem 2 is also given in Appendix  B. Using Theorem 2, the gradient \(\partial v_j^{(l)}/\hat{t}_j^{(l-1)}\) is given by
$$\begin{aligned} \dfrac{\partial v_{j}^{(l)}}{\partial \hat{t}_j^{(l-1)}}\left[ \hat{t}_i^{(l)}\right] = v_{j,m}^{(l)}\left[ \hat{t}_i^{(l)}\right] /\tau _m - v_{j,s}^{(l)}\left[ \hat{t}_i^{(l)}\right] /\tau _s. \end{aligned}$$
(13)
Likewise, this gradient is also zero if this neuron does not spike. Both gradients in Eqs. (11) and (13) can simply be calculated by reading out the four local variables (\(u_{i,m}^{(l)}\), \(u_{i,s}^{(l)}\), \(v_{j,m}^{(l)}\), \(v_{j,s}^{(l)}\)) when the neuron spikes.
The above derivations are for folded NOSONet, where all tensors for each layer are simply overwritten over time so that the space complexity is independent of the number of timesteps. We used unfolded NOSONet in the temporal domain to apply the the automatic differentiation framework [31]. The equivalence between folded and unfolded NOSONets is proven in Appendix  C.

Experiments

Convolutional NOSONet (C-NOSONet) was trained on Fashion-MNIST [40] and CIFAR-10 [22] using BPLC + NOSO. We used the hyperparameters listed in Appendix  E unless otherwise stated. The hyperparameters were manually searched. All experiments were conducted in the Pytorch framework [31] on a GPU workstation (CPU: Intel Xeon Processor Gold, GPU: RTX A6000). NOSONet on Fashion-MNIST was trained using one GPU, whereas NOSONet on CIFAR-10 using four GPUs.

Classification accuracy and the number of spikes for inference

We evaluated the classification accuracy on Fashion-MNIST and CIFAR-10 and the total number of spikes used for inference \(N_{\text {sp}} (=\sum _{i,t}n_{\text {sp}}^{(i,t)})\), where \(n_{\text {sp}}^{(i,t)}\) denotes the number of spikes generated from the layer i at timestep t.
Fashion-MNIST: Fashion-MNIST consists of 70,000 gray-scale images (each of which \(28\times 28\) in size) of clothing categorized as 10 classes [40]. We rescaled each gray-scale pixel value of an image to the range \(0-0.3\) and applied an additive white Gaussian noise (zero mean and 0.05 standard deviation). These values were then used as input currents into input LIF neurons. We trained a C-NOSONet (32C5-MP2-64C5-MP2-600, where MP denotes MinPool). The classification accuracy of the C-NOSONet is shown in Table 2 in comparison with previous works. We also evaluated the total number of spikes \(N_{\text {sp}}\) over all hidden+output NOSOs in the network for each test sample (Table 2). The results highlight large sparsity of active NOSOs, which likely reduces the inference latency when implemented in neuromorphic hardware. This will be discussed in Section “Discussion”. Figure 3a shows the ratio of active NOSOs to all NOSOs, \(\overline{n}_{\text {sp}}^{(i)} (=\sum _tn_{\text {sp}}^{(i,t)}/C^{(i)}H^{(i)}W^{(i)})\), for layer i over the entire timesteps.
CIFAR-10: CIFAR-10 consists of 60,000 real-world color images (each of which \(3\times 32\times 32\) in size) of objects labeled as 10 classes [22]. All training images were pre-processed such that each image with zero-padding of size 4 was randomly cropped to \(32\times 32\), which was followed by random horizontal flipping. The RGB values of each pixel were rescaled to the range \(0-0.3\) and then used as input currents. For learning stability, we linearly increased the initial learning rate (1E-2) to the plateau learning rate (5E-2) for the first five epochs (ramp rate: 8E-3/epoch). The fully trained C-NOSONet (64C5-128C5-MP2-256C5-MP2-512C5-256C5-1024-512) yields the classification accuracy and the number of spikes for inference in Table 2. Notably, our classification accuracy exceeds the result from an SNN of the same depth and width (CNN2-half-ch) [39] by approximately 2.0%. Additionally, our NOSONet uses much fewer spikes (only 10.9% of CNN2-half-ch), supporting high-throughput inference. The layer-wise active NOSO ratio \(\overline{n}_{\text {sp}}^{(i)}\) over the entire timesteps is plotted in Fig. 3b, highlighting the high sparsity of spikes.

Minimum-latency pooling versus MaxPool

MinPool supports the latency code by passing the event of the minimum spiking latency in a given 2D patch. To identify its effects on learning, we compared NOSONets with MinPool layers and conventional MaxPool layers. Figures 4 and 5 show the comparisons on Fashion-MNIST and CIFAR-10, respectively. Compared with MaxPool, MinPool yields (i) the higher classification accuracy as shown in Figs. 4a and 5a and (ii) higher spike sparsity as shown in Figs. 4c and 5c. The accuracy increase despite the decrease in spike number may imply that MinPool removes unimportant spikes in classification unlike dropout that randomly removes spikes.

Effect of symmetric dual threshold on potential distribution

We identified the effect of the dual threshold on potential distribution over samples in a given batch by training NOSONet (32C5-MP2-64C5-MP2-600) on Fashion-MNIST and CIFAR-10 with four different threshold conditions: single threshold 0.05 and 0.1, and dual threshold ±0.1 and ±0.15. The results are shown in Fig. 6. The usage of dual threshold greatly lowers the standard deviation and results in a mean that is almost zero because it limits the potential to the range between \(-\vartheta \) and \(\vartheta \). Additionally, the highest accuracy was attained with the dual threshold ±0.15. The potential distributions for a single threshold case (0.1) and dual threshold case (±0.15) on Fashion-MNIST are detailed in Appendix  F.

Discussion

We estimate the inference time for an SNN mapped onto a general digital multicore neuromorphic processor using the following assumptions.
Assumption 1: Total \(N_\text {n}\) neurons in a given SNN are distributed uniformly over \(N_\text {c}\) cores of a neuromorphic processor, i.e., \(N_{\text {n}}/N_{\text {c}}\) neurons per core.
Assumption 2: All \(N_{\text {n}}/N_{\text {c}}\) neurons in each core share a multiplier by time-division multiplexing, so that the current potential is multiplied by a potential decay factor (\(e^{-1/\tau _\text {m}}\)) for one neuron at each cycle.
Assumption 3: Synaptic operations are also executed serially.
Assumption 4: Neurons in different cores are updated parallel.
Each timestep for an SNN with LIF neurons includes two primary processes: (i) the process of multiplying the current potential by a decay factor and (ii) synaptic operation (spike routing to the destination neurons plus the consequent potential update). Process (i) in a digital neuromorphic processor is commonly pipelined within a core but executed in parallel over the \(N_{\text {c}}\) cores [20]. Thus, at each timestep, the time for process (i) for all \(N_{\text {n}}\) neurons (\(T_{\text {up}}\)) is given by
$$\begin{aligned} T_{\text {up}} = \left( N_{\text {n}}/N_{\text {c}} + a\right) f_{\text {clk}}^{-1}\text {,} \end{aligned}$$
where a and \(f_{\text {clk}}\) denote the initialization cycle number and clock speed, respectively. Although the number of initialization cycles a differs for different processor designs, it is commonly a few clock cycles. Given the total number of spikes generated at timestep t (\(n_{\text {sp}}[t]\)), the time for synaptic operations at each timestep is given by
$$\begin{aligned} T_{\text {sop}} = n_{\text {sp}}[t]\left( \text {SynOPS}\right) ^{-1}\text {.} \end{aligned}$$
Given Assumptions, the total time for processes (i) and (ii) at each timestep is given by \(T_{\text {step}}=T_{\text {up}}+T_{\text {sop}}\). Therefore, we have the total time for inference during total \(N_{\text {step}}\) timesteps, \(T_{\text {inf}}=\sum _tT_{\text {step}}[t]\), as follows.
$$\begin{aligned} T_{\text {inf}} = N_{\text {step}}\left( N_{\text {n}}/N_{\text {c}} + a\right) f_{\text {clk}}^{-1} + N_{\text {sp}}\left( \text {SynOPS}\right) ^{-1}\text {,} \end{aligned}$$
(14)
where \(N_{\text {sp}}=\sum _tn_{\text {sp}}[t]\). The number of neurons in a core \((N_{\text {n}}/N_{\text {c}})\) differs for different designs. We assume 1k neurons in each core [8], a few tens MSynOPS as for [3, 12, 27], and 100 MHz clock speed. For inference involving \(N_{sp}\) spikes (\(\sim 10^6\) as in Table 2) and a \(N_{\text {step}}\) of \(\sim 100\), Eq. (14) identifies that \(T_{\text {sop}}\) is dominant over \(T_{\text {up}}\) so that \(T_{\text {inf}}\) is dictated by \(T_{\text {sop}}\). Therefore, it is desired to concern \(N_{\text {sp}}\) when developing learning algorithms.
For SNNs with IF neurons (without leakage), process (i) is unnecessary so that \(T_{\text {up}}\) vanishes. Therefore, \(T_{\text {inf}}\) is solely determined by \(N_{\text {sp}}\).

Conclusion and outlook

We proposed a mathematically rigorous learning algorithm (BPLC) based on spiking latency code in conjunction with minimum-latency pooling (MinPool) operations. We overcome the dead neuron issue using a symmetric dual threshold for spiking, which additionally improves the potential distribution over samples in a given batch (and thus the classification accuracy). BPLC-trained NOSONet on CIFAR-10 highlights its high accuracy outperforming the SNN of the same depth and width by approximately 2\(\%\) with much fewer spikes (only 10.9%). This large reduction in the number of spikes largely reduces the inference latency of SNNs implemented in digital neuromorphic processors.
Currently, we conceive the following future work to boost the impact of BPLC + NOSO.
  • Scalability confirmation: Although the viability of BPLC + NOSO was identified, its applicability to deeper SNNs on more complex datasets should be confirmed. Such datasets include not only static image datasets like ImageNet [33] but also event datasets like CIFAR10-DVS [24] and DVS128 Gesture [1]. Given that the number of spikes is severely capped, BPLC + NOSO on event datasets in particular might be challenging.
  • Hyperparameter fine-tuning: To further increase the classification accuracy, the hyperparameters should be fine-tuned using optimization techniques.
  • Weight quantization: BPLC + NOSO is based on full-precision (32b FP) weights. However, the viability of BPLC + NOSO with reduced precision weights should be confirmed to improve the efficiency in memory use. This may need an additional weight-quantization algorithm in conjunction with BPLC + NOSO like CBP [18].
  • Search for new application domains: We need to search for new applications domains in which BPLC + NOSO can leverage its low process latency and power when implemented in neuromorphic hardware. The examples potentially include intelligent control systems like constrained nonlinear systems [4143].

Declarations

Conflict of interest

The authors have no relevant financial or non-financial interests to disclose.

Code availability

The code is available in the GitHub repository, https://​github.​com/​dooseokjeong/​BPLC-NOSO.
Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://​creativecommons.​org/​licenses/​by/​4.​0/​.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Anhänge

Appendix A Derivation of backward propagation of errors

We define
$$\begin{aligned}&C_{\tau } = \frac{\tau _{m}}{\tau _{m}-\tau _{s}}\text {,}\nonumber \\&S_{j}\left[ t\right] = \Theta \left[ t-\hat{t}_{j}^{(l-1)}\right] \text {,}\nonumber \\&E_{j,(\cdot )}\left[ t\right] = e^{-\left( t-\hat{t}_{j}^{(l-1)}\right) \mathbin {/}\tau _{(\cdot )}} \text {, where } (\cdot )\in \left\{ m, s\right\} \text {.} \end{aligned}$$
(A1)
The subthreshold membrane potential of NOSO is
$$\begin{aligned} u_{i}^{(l)}\left[ t\right] = \sum _{j} C_{\tau }w_{ij}^{(l)}\left( E_{j, m}[t]-E_{j, s}[t] \right) S_j\left[ t\right] sav_i^{(l)}\left[ t\right] \text {.} \end{aligned}$$
(A2)
Thus, the following equation holds when spiking with a spiking threshold \(\vartheta \).
$$\begin{aligned} \vartheta ={} & {} \sum _{j} C_{\tau }w_{ij}^{(l)} \left( E_{j, m}\left[ \hat{t}_{i}^{(l)}\right] - E_{j, s}\left[ \hat{t}_{i}^{(l)}\right] \right) \nonumber \\{} & {} \times S_{j}\left[ \hat{t}_{i}^{(l)}\right] sav_i^{(l)}\left[ \hat{t}_{i}^{(l)}\right] \text {.} \end{aligned}$$
(A3)
For simplicity, we omit the spiking-availability function \(sav_i^{(l)}\) hereafter. The derivative \(\partial \hat{t}_{i}^{(l)}/\partial \hat{t}_{j}^{(l-1)}\) is acquired by differentiating Eq. (A3) with respect to \(\hat{t}_{j}^{(l-1)}\).
$$\begin{aligned}{} & {} \dfrac{\partial \hat{t}_{i}^{(l)}}{\partial \hat{t}_{j}^{(l-1)}}\nonumber \\{} & {} \quad =\dfrac{C_{\tau }w_{ij}^{(l)}\left( \tau _m^{-1}E_{j, m}\left[ \hat{t}_{i}^{(l)}\right] - \tau _s^{-1}E_{j, s}\left[ \hat{t}_{i}^{(l)}\right] \right) S_{j}\left[ \hat{t}_i^{(l)}\right] }{\sum _k C_{\tau }w_{ik}^{(l)}\left( \tau _m^{-1}E_{k,m}\left[ \hat{t}_i^{(l)}\right] - \tau _s^{-1}E_{k,s}\left[ \hat{t}_i^{(l)}\right] \right) S_{k}\left[ \hat{t}_i^{(l)}\right] }\text {.}\nonumber \\ \end{aligned}$$
(A4)
According to Theorem 1, the denominator of the right-hand side of Eq. (A4) equals \(\left( \partial \hat{t}_i^{(l)}/\partial u_i^{(l)}\right) ^{-1}\), and thus we have
$$\begin{aligned}{} & {} \dfrac{\partial \hat{t}_{i}^{(l)}}{\partial \hat{t}_{j}^{(l-1)}}\nonumber \\{} & {} \quad =C_{\tau }w_{ij}^{(l)}\left( \dfrac{E_{j, m}\left[ \hat{t}_{i}^{(l)}\right] }{\tau _m} - \dfrac{E_{j, s}\left[ \hat{t}_{i}^{(l)}\right] }{\tau _s}\right) S_{j}\left[ \hat{t}_i^{(l)} \right] \dfrac{\partial \hat{t}_i^{(l)}}{\partial u_i^{(l)}}. \nonumber \\ \end{aligned}$$
(A5)
Applying a chain rule on the left-hand side of Eq. (A5) yields the following equation—
$$\begin{aligned} \dfrac{\partial u_i^{(l)}}{\partial \hat{t}_{j}^{(l-1)}} = C_{\tau }w_{ij}^{(l)}\left( \dfrac{E_{j, m}\left[ \hat{t}_{i}^{(l)}\right] }{\tau _m} - \dfrac{E_{j, s}\left[ \hat{t}_{i}^{(l)}\right] }{\tau _s} \right) S_{j}\left[ \hat{t}_i^{(l)}\right] . \end{aligned}$$
(A6)
Given that
$$\begin{aligned} v_{j,(\cdot )}^{(l)}\left[ t\right] = C_{\tau }E_{j,(\cdot )}\left[ t\right] S_{j}\left[ t\right] \text {, where } (\cdot )\in \left\{ m, s\right\} , \end{aligned}$$
(A7)
Eq. (A6) is re-expressed as
$$\begin{aligned} \dfrac{\partial u_i^{(l)}}{\partial \hat{t}_{j}^{(l-1)}} = w_{ij}^{(l)}\left( v_{j,m}^{(l)}\left[ \hat{t}_i^{(l)}\right] /\tau _m - v_{j,s}^{(l)}\left[ \hat{t}_i^{(l)}\right] /\tau _s \right) . \end{aligned}$$
(A8)
According to Theorem 2,
$$\begin{aligned} \dfrac{\partial v_{j,\left( \cdot \right) }^{(l)}}{\partial \hat{t}_j^{(l-1)}} \left[ \hat{t}_i^{(l)}\right] =&\dfrac{C_{\tau }}{\tau _{\left( \cdot \right) }} E_{j, (\cdot )}\left[ \hat{t}_i^{(l)}\right] S_j\left[ \hat{t}_i^{(l)}\right] \text {,} \\\nonumber&\text {where } (\cdot )\in \left\{ m, s\right\} .\nonumber \end{aligned}$$
Using Eq. (A7) at \(t=\hat{t}_i^{(l)}\), the following equation holds: \(\tau _{(\cdot )}^{-1}v_{j,(\cdot )}^{(l)}\left[ \hat{t}_i^{(l)}\right] =\partial v_{j,(\cdot )}^{(l)}/\partial \hat{t}_{j}^{(l)}\left[ \hat{t}_i^{(l)}\right] \), where \(\left( \cdot \right) \in \left\{ m, s\right\} \). Therefore, Eq. (A8) is re-arranged as
$$\begin{aligned} \dfrac{\partial u_i^{(l)}}{\partial \hat{t}_{j}^{(l-1)}} = w_{ij}^{(l)}\dfrac{\partial v_j^{(l)}}{\partial \hat{t}_j^{(l)}}\left[ \hat{t}_i^{(l)}\right] . \end{aligned}$$
(A9)
The error for the jth neuron in the \((l-1)\)th layer \(e_j^{(l-1)}\) is given by
$$\begin{aligned} e_j^{(l-1)}&= \dfrac{\partial \mathcal {L}}{\partial \hat{t}_{j}^{(l-1)}}\dfrac{\partial \hat{t}_{j}^{(l-1)}}{\partial u_j^{(l-1)}}\nonumber \\&=\sum _i\left( \dfrac{\partial \mathcal {L}}{\partial \hat{t}_{i}^{(l)}}\dfrac{\partial \hat{t}_{i}^{(l)}}{\partial u_i^{(l)}}\right) \dfrac{\partial u_i^{(l)}}{\partial \hat{t}_{j}^{(l-1)}}\dfrac{\partial \hat{t}_{j}^{(l-1)}}{\partial u_j^{(l-1)}}\nonumber \\&=\sum _ie_i^{(l)}\dfrac{\partial u_i^{(l)}}{\partial \hat{t}_{j}^{(l-1)}}\dfrac{\partial \hat{t}_{j}^{(l-1)}}{\partial u_j^{(l-1)}}. \end{aligned}$$
(A10)
Plugging Eq. (A9) into Eq. (A10) therefore leads to
$$\begin{aligned} e_j^{(l-1)} = \sum _ie_i^{(l)}w_{ij}^{(l)}\dfrac{\partial v_j^{(l)}}{\partial \hat{t}_j^{(l)}}\left[ \hat{t}_i^{(l)}\right] \dfrac{\partial \hat{t}_{j}^{(l-1)}}{\partial u_j^{(l-1)}}. \end{aligned}$$
(A11)
Equation (A11) is expressed as the following matrix formula.
$$\begin{aligned}\nonumber \varvec{e}^{(l-1)} = \left( \varvec{w}^{(l)\mathrm T}\odot \varvec{v}^{(l)'}\left[ \varvec{\hat{t}}^{(l)}\right] \right) \varvec{e}^{(l)}\odot \hat{\varvec{t}}^{(l-1)'}, \end{aligned}$$
where
$$\begin{aligned} \varvec{v}^{(l)'}\left[ \varvec{\hat{t}}^{(l)}\right] = \begin{bmatrix} \dfrac{\partial v^{(l)}_1}{\partial \hat{t}^{(l-1)}_1}\left[ \hat{t}^{(l)}_1\right] &{} \ldots &{} \dfrac{\partial v^{(l)}_1}{\partial \hat{t}^{(l-1)}_1}\left[ \hat{t}^{(l)}_N\right] \\ \vdots &{} \ddots &{} \vdots \\ \dfrac{\partial v^{(l)}_M}{\partial \hat{t}^{(l-1)}_m}\left[ \hat{t}^{(l)}_1\right] &{} \ldots &{} \dfrac{\partial v^{(l)}_M}{\partial \hat{t}^{(l-1)}_m}\left[ \hat{t}^{(l)}_N\right] \end{bmatrix}.\nonumber \end{aligned}$$

Appendix B Proofs of Theorems

Theorem 1
When an SRM neuron (whose membrane potential is \(u_i^{(l)}\)) spikes at a given time \(t (=\hat{t}_i^{(l)})\), the gradient of spike timing \(\hat{t}_i^{(l)}\) with membrane potential is given by
$$\begin{aligned} \dfrac{\partial \hat{t}_i^{(l)}}{\partial u_{i}^{(l)}} = \left( u_{i,m}^{(l)}\left[ \hat{t}_i\right] /\tau _m - u_{i,s}^{(l)}\left[ \hat{t}_i\right] /\tau _s\right) ^{-1}.\nonumber \end{aligned}$$
Proof
The update of weight \(w_{ij}^{(l)}\) is calculated using the gradient descent method as follows—
$$\begin{aligned} \Delta w_{ij}^{(l)}&= -\eta \dfrac{\partial \mathcal {L}}{\partial w_{ij}^{(l)}}\nonumber \\&= -\eta \dfrac{\partial \mathcal {L}}{\partial \hat{t}_{i}^{(l)}} \dfrac{\partial \hat{t}_{i}^{(l)}}{\partial w_{ij}^{(l)}}\nonumber \\&= -\eta \dfrac{\partial \mathcal {L}}{\partial \hat{t}_{i}^{(l)}} \dfrac{\partial \hat{t}_{i}^{(l)}}{\partial u_i^{(l)}} \dfrac{\partial u_i^{(l)}}{\partial w_{ij}^{(l)}}\left[ \hat{t}_i^{(l)}\right] . \end{aligned}$$
(B12)
Regarding that \(u_i^{(l)}\left[ t\right] =w_{ij}^{(l)}v_j^{(l)}\left[ t\right] \), the gradient \(\partial u_i^{(l)}/\partial w_{ij}^{(l)}\left[ t\right] \) equals \(v_j^{(l)}\left[ t\right] \). Consequently, we have
$$\begin{aligned} \Delta w_{ij}^{(l)} = -\eta \dfrac{\partial \mathcal {L}}{\partial \hat{t}_{i}^{(l)}} \dfrac{\partial \hat{t}_{i}^{(l)}}{\partial u_i^{(l)}} v_j^{(l)}\left[ \hat{t}_i^{(l)}\right] . \end{aligned}$$
(B13)
Differentiating Eq. (A3) with \(w_{ij}^{(l)}\) yields
$$\begin{aligned} \dfrac{\partial \vartheta }{\partial w_{ij}^{(l)}} = v_{j}^{(l)}\left[ \hat{t}_i^{(l)}\right] + \dfrac{\partial u_i^{(l)}}{\partial t}\left[ {\hat{t}_i^{(l)}}\right] \dfrac{\partial \hat{t}_i^{(l)}}{\partial w_{ij}^{(l)}}. \end{aligned}$$
(B14)
The left-hand side of Eq. (B14) is zero because the threshold \(\vartheta \) is constant. Thus, the following equation holds—
$$\begin{aligned} \dfrac{\partial \hat{t}_i^{(l)}}{\partial w_{ij}^{(l)}} = -\left( \dfrac{\partial u_i^{(l)}}{\partial t}\left[ {\hat{t}_i^{(l)}}\right] \right) ^{-1}v_j^{(l)}\left[ \hat{t}_i^{(l)}\right] . \end{aligned}$$
(B15)
Plugging Eq. (B15) into Eq. (B12) yields
$$\begin{aligned} \Delta w_{ij}^{(l)} = \eta \dfrac{\partial \mathcal {L}}{\partial \hat{t}_{i}^{(l)}} \left( \dfrac{\partial u_i^{(l)}}{\partial t}\left[ {\hat{t}_i^{(l)}}\right] \right) ^{-1} v_j^{(l)}\left[ \hat{t}_i^{(l)}\right] . \end{aligned}$$
(B16)
A comparison between Eqs. (B13) and (B16) indicates that the following equation holds.
$$\begin{aligned} \dfrac{\partial \hat{t}_i^{(l)}}{\partial u_{i}^{(l)}} = -\left( \dfrac{\partial u_i^{(l)}}{\partial t}\left[ {\hat{t}_i^{(l)}}\right] \right) ^{-1}. \end{aligned}$$
(B17)
The right-hand side of Eq. (B17) is obtained by differentiating Eq. (A2) with t and evaluating the derivative at the spike timing \(\hat{t}_i^{(l)}\), which finally leads to
$$\begin{aligned} \dfrac{\partial \hat{t}_i^{(l)}}{\partial u_{i}^{(l)}} = \left( u_{i,m}^{(l)}\left[ \hat{t}_i^{(l)}\right] /\tau _m - u_{i,s}^{(l)}\left[ \hat{t}_i^{(l)}\right] /\tau _s\right) ^{-1},\nonumber \end{aligned}$$
where
$$\begin{aligned}\nonumber u_{i,\left( \cdot \right) }^{(l)}\left[ \hat{t}^{(l)}_i\right]&= \sum _{j} C_{\tau }w_{ij}^{(l)} E_{j, (\cdot )}\left[ \hat{t}^{(l)}_i\right] S_j\left[ \hat{t}^{(l)}_i\right] \text {,}\\&\quad \text { where } (\cdot )\in \left\{ m, s\right\} . \end{aligned}$$
\(\square \)
Theorem 2
When an SRM neuron receives an input spike at \(\hat{t}_j^{(l-1)}\), the gradients of \(v_{j,m}^{(l)}\) and \(v_{j,s}^{(l)}\) with respect to \(\hat{t}_j^{(l-1)}\) are given by
$$\begin{aligned} \dfrac{\partial v_{j,\left( \cdot \right) }^{(l)}}{\partial \hat{t}_j^{(l-1)}}\left[ t\right] = \dfrac{C_\tau }{\tau _{(\cdot )}} E_{j, (\cdot )}\left[ t\right] S_j\left[ t\right] \text {, where } (\cdot )\in \left\{ m, s\right\} .\nonumber \end{aligned}$$
Proof
The variables \(v_{j,m}^{(l)}\) and \(v_{j,s}^{(l)}\) are given by
$$\begin{aligned} v_{j,\left( \cdot \right) }^{(l)}\left[ t\right] = C_{\tau }E_{j, (\cdot )}\left[ t\right] S_j\left[ t\right] , \text {where } (\cdot )\in \left\{ m, s\right\} . \end{aligned}$$
(B18)
To be precise, the Heaviside step function in Eq. (B18) should be \(\Theta \left[ t-\hat{t}_{j}^{(l-1)}-\varepsilon \right] \) with \(\varepsilon \rightarrow 0^+\) because \(v_{j,\left( \cdot \right) }^{(l)}\) at \(\hat{t}_j^{(l-1)}\) is \(\tau _m/\left( \tau _m-\tau _s\right) \) rather than \(\tau _m/\left[ 2\left( \tau _m-\tau _s\right) \right] \). Given this substitution, differentiating Eq. (B18) with respect to \(\hat{t}_j^{(l-1)}\) yields
$$\begin{aligned}\nonumber \dfrac{\partial v_{j,\left( \cdot \right) }^{(l)}}{\partial \hat{t}_j^{(l-1)}}\left[ t\right] = \dfrac{C_{\tau }}{\tau _{(\cdot )}} E_{j, (\cdot )}\left[ t\right] S_j\left[ t\right] \text {, where } (\cdot )\in \left\{ m, s\right\} . \end{aligned}$$
\(\square \)
Theorem 3
Spike-stamp vectors for NOSOs satisfy the following equation:
$$\begin{aligned} \varvec{s}^{(l)}\left[ t_1\right] \odot \varvec{s}^{(l)}\left[ t_2\right] =\left\{ \begin{array}{rl} \varvec{s}^{(l)}\left[ t_1\right] &{}\quad \text {if}\quad t_1=t_2 \\ \varvec{0} &{}\quad \text {otherwise}. \end{array}\right. \end{aligned}$$
(B19)
Proof
Because NOSOs spike once maximally, for all i, \(s_i^{(l)}\left[ t_1\right] s_i^{(l)}\left[ t_2\right] =0\) if \(t_1\ne t_2\), and \(s_i^{(l)}\left[ t_1\right] s_i^{(l)}\left[ t_2\right] =s_i^{(l)}\left[ t_1\right] \) if \(t_1 = t_2\). Therefore, Eq. (B19) is true. \(\square \)
Theorem 4
The weight update for the folded SNN,
$$\begin{aligned} \Delta \varvec{w}^{(l)} = -\eta diag\left( \varvec{e}^{(l)\mathrm T}\right) \varvec{v}^{(l)}\left[ \varvec{\hat{t}}^{(l)}\right] \text {,} \end{aligned}$$
(B20)
is equivalent to the following equation—
$$\begin{aligned} \Delta \varvec{w}^{(l)} = -\eta \sum _{t=1}^{T}\left( \varvec{\overline{e}}^{(l)}\left[ t\right] \odot \varvec{s}^{(l)}\left[ t\right] \right) \varvec{v}^{(l)}\left[ t\right] ^\textrm{T}, \end{aligned}$$
(B21)
where \(\varvec{v}^{(l)}[t]\) is given by \(\varvec{v}^{(l)}[t]=\left[ v_1^{(l)}[t],\cdots , v_m^{(l)}[t]\right] ^\textrm{T}\).
Proof
The error \(\varvec{e}^{(l)}\) is known to be
$$\begin{aligned} \varvec{e}^{(l)} = \sum _{t=1}^{T}\varvec{\overline{e}}^{(l)}\left[ t\right] \odot \varvec{s}^{(l)}\left[ t\right] . \end{aligned}$$
(B22)
Using Eq. (B22) and a basic property of the Hamadard product, the matrix \(diag\left( \varvec{e}^{(l)\mathrm T}\right) \) on the right-hand side of Eq. (B20) is unfolded as
$$\begin{aligned} diag\left( \varvec{e}^{(l)}\right) = \sum _{t=1}^{T}diag\left( \varvec{\overline{e}}^{(l)}\left[ t\right] \right) diag\left( \varvec{s}^{(l)}\left[ t\right] \right) . \end{aligned}$$
(B23)
The matrix \(\varvec{v}^{(l)}\left[ \varvec{\hat{t}}^{(l)}\right] \) in Eq. (B20), given by
$$\begin{aligned} \varvec{v}^{(l)}\left[ \varvec{\hat{t}}^{(l)}\right] = \begin{bmatrix} v^{(l)}_1\left[ \hat{t}^{(l)}_1\right] &{} \ldots &{} v^{(l)}_M\left[ \hat{t}^{(l)}_1\right] \\ \vdots &{} \ddots &{} \vdots \\ v^{(l)}_1\left[ \hat{t}^{(l)}_1\right] &{} \ldots &{} v^{(l)}_M\left[ \hat{t}^{(l)}_N\right] \end{bmatrix},\nonumber \end{aligned}$$
is unfolded as
$$\begin{aligned} \varvec{v}^{(l)'}\left[ \varvec{\hat{t}}^{(l)}\right] = \sum _{t'=1}^{T}\varvec{s}^{(l)}\left[ t'\right] \varvec{v}^{(l)}\left[ t'\right] ^\textrm{T}. \end{aligned}$$
(B24)
Entering Eqs. (B23) and (B24) into Eq. (B20) yields
$$\begin{aligned} \Delta \varvec{w}^{(l)}&= -\eta \sum _{t}^{T}\sum _{t'}^{T} \bigg [ diag\left( \varvec{\overline{e}}^{(l)}\left[ t\right] \right) \nonumber \\&\qquad diag\left( \varvec{s}^{(l)}\left[ t\right] \right) \varvec{s}^{(l)}\left[ t'\right] \varvec{v}^{(l)}\left[ t'\right] ^\textrm{T} \bigg ]. \end{aligned}$$
(B25)
Note that \(diag\left( \varvec{s}^{(l)}\left[ t\right] \right) \varvec{s}^{(l)}\left[ t'\right] =\varvec{s}^{(l)}\left[ t\right] \odot \varvec{s}^{(l)}\left[ t'\right] \), which is always zero if \(t\ne t'\) according to Theorem 3. Therefore, we have
$$\begin{aligned}\nonumber \Delta \varvec{w}^{(l)}&= -\eta \sum _{t=1}^{T}diag\left( \varvec{\overline{e}}^{(l)}\left[ t\right] \right) \varvec{s}^{(l)}\left[ t\right] \varvec{v}^{(l)}\left[ t\right] ^\textrm{T}.\\\nonumber&= -\eta \sum _{t=1}^{T}\left( \varvec{\overline{e}}^{(l)}\left[ t\right] \odot \varvec{s}^{(l)}\left[ t\right] \right) \varvec{v}^{(l)}\left[ t\right] ^\textrm{T}.\nonumber \end{aligned}$$
\(\square \)
Theorem 5
The backward propagation of errors
$$\begin{aligned} \varvec{e}^{(l-1)} = \left( \varvec{w}^{(l)\mathrm T}\odot \varvec{v}^{(l)'}\left[ \varvec{\hat{t}}^{(l)}\right] \right) \varvec{e}^{(l)}\odot \hat{\varvec{t}}^{(l-1)'} \end{aligned}$$
(B26)
is unfolded over the timesteps as follows—
$$\begin{aligned}\nonumber \varvec{e}^{(l-1)}&= \sum _{t=1}^{T}\varvec{\tilde{e}}^{(l-1)}[t],\nonumber \\ \varvec{\tilde{e}}^{(l-1)}[t]&= \sum _{t'=t}^{T}\left( \varvec{w}^{(l)\mathrm T}\varvec{\tilde{e}}^{(l)\left[ t'\right] }\right) \odot \varvec{B}^{(l)}\left[ t,t'\right] \nonumber \\&\qquad \odot \varvec{A}^{(l-1)}\left[ t\right] \odot \varvec{s}^{(l-1)}\left[ t\right] ,\nonumber \\ \varvec{B}^{(l)}\left[ t,t'\right]&= C_{\tau } \left[ \tau _m^{-1}e^{-\left( t'-t\right) /\tau _m} - \tau _s^{-1}e^{-\left( t'-t\right) /\tau _s} \right] \varvec{1}.\nonumber \end{aligned}$$
The all-one vector is denoted by \(\varvec{1}=\left[ 1,\cdots , 1\right] ^\textrm{T}\).
Proof
The matrix \(\varvec{v}^{(l)'}\left[ \varvec{\hat{t}}^{(l)}\right] \) in Eq. (B26),
$$\begin{aligned} \varvec{v}^{(l)'}\left[ \varvec{\hat{t}}^{(l)}\right] = \begin{bmatrix} \dfrac{\partial v^{(l)}_1}{\partial \hat{t}^{(l-1)}_1}\left[ \hat{t}^{(l)}_1\right] &{} \ldots &{} \dfrac{\partial v^{(l)}_1}{\partial \hat{t}^{(l-1)}_1}\left[ \hat{t}^{(l)}_N\right] \\ \vdots &{} \ddots &{} \vdots \\ \dfrac{\partial v^{(l)}_M}{\partial \hat{t}^{(l-1)}_M}\left[ \hat{t}^{(l)}_1\right] &{} \ldots &{} \dfrac{\partial v^{(l)}_M}{\partial \hat{t}^{(l-1)}_M}\left[ \hat{t}^{(l)}_N\right] \end{bmatrix},\nonumber \end{aligned}$$
is unfolded as
$$\begin{aligned}{} & {} \varvec{v}^{(l)'}\left[ \varvec{\hat{t}}^{(l)}\right] = \sum _{t=1}^{T}\varvec{C}^{(l)}\left[ t\right] \varvec{s}^{(l)}\left[ t\right] ^\textrm{T}. \end{aligned}$$
(B27)
$$\begin{aligned}{} & {} \varvec{C}^{(l)}\left[ t\right] =\left[ \frac{\partial v^{(l)}_1}{\partial \hat{t}_1^{(l-1)}}\left[ t\right] ,\cdots \frac{\partial v^{(l)}_M}{\partial \hat{t}_M^{(l-1)}}\left[ t\right] \right] ^\textrm{T}. \end{aligned}$$
(B28)
Its elements are given by
$$\begin{aligned} \dfrac{\partial v_j^{(l)}}{\partial \hat{t}_j^{(l-1)}}\left[ t\right] = C_{\tau }\left( \dfrac{E_{j,m}\left[ t\right] }{\tau _m} - \dfrac{E_{j,s}\left[ t\right] }{\tau _s}\right) S_j\left[ t\right] . \end{aligned}$$
(B29)
Note that the element \(\partial v_j^{(l)}/\partial \hat{t}_j^{(l-1)}\left[ t\right] \) is a continuous function of \(t \left( \ge \hat{t}_j^{(l-1)}\right) \), and \(\varvec{v}^{(l)'}\left[ \varvec{\hat{t}}^{(l)}\right] \) is the read of \(\varvec{A}^{(l)}\left[ t\right] \) at \(\varvec{\hat{t}}^{(l)}\) using \(\varvec{s}^{(l)}\). Plugging Eqs. (B22) and (B27) into Eq. (B26) yields
$$\begin{aligned} \varvec{e}^{(l-1)} =&\sum _{t'=1}^{T}\sum _{t=1}^{T} \bigg [ \varvec{w}^{(l)\mathrm T}\odot \left( \varvec{A}^{(l)}\left[ t'\right] \varvec{s}^{(l)}\left[ t'\right] ^\textrm{T}\right) \varvec{\overline{e}}^{(l)}\left[ t\right] \nonumber \\&\odot \varvec{s}^{(l)}\left[ t\right] \bigg ] \odot \hat{\varvec{t}}^{(l-1)'}. \end{aligned}$$
(B30)
We use a general property of the Hadamard product,
$$\begin{aligned} \left( \varvec{w}\odot \varvec{a}\varvec{b}^\textrm{T}\right) \varvec{c} = \left[ \varvec{w}\left( \varvec{b}\odot \varvec{c}\right) \right] \odot \varvec{a},\nonumber \end{aligned}$$
where \(\varvec{w}\in \mathbb {R}^{n\times m}\), \(\varvec{a}\in \mathbb {R}^{n}\), \(\varvec{b}\in \mathbb {R}^{m}\), and \(\varvec{c}\in \mathbb {R}^{m}\). Equation (B30) is consequently arranged as
$$\begin{aligned} \varvec{e}^{(l-1)} =&\sum _{t'=1}^{T}\sum _{t=1}^{T} \bigg [ \varvec{w}^{(l)\mathrm T} \bigg (\varvec{\overline{e}}^{(l)}\left[ t\right] \odot \varvec{s}^{(l)}\left[ t'\right] \odot \varvec{s}^{(l)}\left[ t\right] \bigg ) \nonumber \\&\odot \varvec{C}^{(l)}\left[ t'\right] \bigg ] \odot \hat{\varvec{t}}^{(l-1)'}. \end{aligned}$$
(B31)
Using Theorem 3, we have
$$\begin{aligned} \varvec{e}^{(l-1)} =&\sum _{t'=1}^{T}\left[ \varvec{w}^{(l)\mathrm T}\left( \varvec{\overline{e}}^{(l)}\left[ t'\right] \odot \varvec{s}^{(l)}\left[ t'\right] \right) \odot \varvec{C}^{(l)}\left[ t'\right] \right] \nonumber \\&\odot \hat{\varvec{t}}^{(l-1)'}\text {.} \end{aligned}$$
(B32)
Considering the following equations—
$$\begin{aligned}\nonumber \dfrac{\partial \hat{t}_i^{(l)}}{\partial u_{i}^{(l)}}\left[ t\right]&= A_i^{(l)}\left[ t\right] \delta \left[ t-\hat{t}_i^{(l)}\right] ,\nonumber \\ A_i^{(l)}\left[ t\right]&= \left( u_{i,m}^{(l)}\left[ t\right] /\tau _m - u_{i,s}^{(l)}\left[ t\right] /\tau _s\right) ^{-1},\nonumber \end{aligned}$$
The gradient \(\hat{\varvec{t}}^{(l-1)'}\) on the right-hand side of Eq. (B32) is unfolded as
$$\begin{aligned} \hat{\varvec{t}}^{(l-1)'} = \sum _{t=1}^{T}\varvec{A}^{(l-1)}\left[ t\right] \odot \varvec{s}^{(l-1)}\left[ t\right] . \end{aligned}$$
(B33)
From Eqs. (B32) and (B33), we have
$$\begin{aligned} \varvec{e}^{(l-1)} =&\sum _{t=1}^{T} \sum _{t'=t}^{T}\bigg [\varvec{w}^{(l)\mathrm T} \left( \varvec{\overline{e}}^{(l)}\left[ t'\right] \odot \varvec{s}^{(l)}\left[ t'\right] \right) \nonumber \\&\odot \varvec{C}^{(l)}\left[ t'\right] \bigg ] \odot \varvec{A}^{(l-1)}\left[ t\right] \odot \varvec{s}^{(l-1)}\left[ t\right] . \end{aligned}$$
(B34)
Note that the lower limit of the summation over \(t'\) is set to t because \(\varvec{C}^{(l)}\left[ t'\right] \) in this equation becomes zero for any \(t'<t\) according to Theorem 2 (see the Heaviside step function). As such, \(\varvec{C}^{(l)}\left[ t'\right] =\left[ \partial v^{(l)}_1/\partial \hat{t}_1^{(l-1)}\left[ t'\right] ,\cdots \partial v^{(l)}_m/\partial \hat{t}_m^{(l-1)}\left[ t'\right] \right] ^\textrm{T}\). If we leave the presynaptic spike timing \(\hat{t}_j^{(l-1)}\) as a variable t, the element becomes
$$\begin{aligned} \dfrac{\partial v_j^{(l)}}{\partial \hat{t}_j^{(l-1)}}\left[ t'\right] = C_\tau \left[ \tau _m^{-1}e^{- \left( t'-t\right) /\tau _m}-\tau _s^{-1}e^{- \left( t'-t\right) /\tau _s} \right] . \end{aligned}$$
(B35)
As shown in Eq. (B34), the variable t is the time argument of the presynaptic spike-stamp vector \(s^{(l-1)}\), so that t such that \(s_j^{(l-1)}\left[ t\right] =1\) is \(t_j^{(l-1)}\), rendering Eq. (B35) equal to Eq. (B29). For clearity, we introduce a new vector \(\varvec{B}^{(l)}\left[ t,t'\right] \) whose element is given by Eq. (B35). The product \(\varvec{\overline{e}}^{(l)}\left[ t'\right] \odot \varvec{s}^{(l)}\left[ t'\right] \) in Eq. (B33) is the error at the timestep \(t'\), i.e., \(\varvec{\tilde{e}}^{(l-1)}\left[ t'\right] \). Therefore, we eventually have
$$\begin{aligned} \varvec{e}^{(l-1)} = \sum _{t=1}^{T}\varvec{\tilde{e}}^{(l-1)}\left[ t\right] , \end{aligned}$$
(B36)
where
$$\begin{aligned} \varvec{\tilde{e}}^{(l-1)}[t] =&\sum _{t_1=t}^{T}\left( \varvec{w}^{(l)\mathrm T}\varvec{\tilde{e}}^{(l-1)} \left[ t'\right] \right) \nonumber \\&\odot \varvec{B}^{(l)}\left[ t,t'\right] \odot \varvec{A}^{(l-1)}\left[ t\right] \odot \varvec{s}^{(l-1)}\left[ t\right] , \end{aligned}$$
(B37)
\(\square \)

Appendix C Proof of equivalence between folded and unfolded NOSONets

NOSONet can be unfolded on a computational graph to use the the automatic differentiation framework [31]. To begin with, we define a spike-stamp vector at timestep t (\(s^{(l)}\left[ t\right] \)) such that its element is ‘1’ if the corresponding NOSO spikes at the timestep, and ‘0’ otherwise.
$$\begin{aligned}\nonumber \varvec{s}^{(l)}\left[ t\right] =\left[ \delta \left[ t-\hat{t}_1^{(l)}\right] ,\cdots , \delta \left[ t-\hat{t}_n^{(l)}\right] \right] ^\textrm{T}. \end{aligned}$$
Given that the variables \(u_{i,m}^{(l)}\) and \(u_{i,s}^{(l)}\) are continuous functions of time t, the gradient \(\partial \hat{t}_i^{(l)}/\partial u_i^{(l)}\) in Eq. (11) is the read-out of \(\left( u_{i,m}^{(l)}\left[ t\right] /\tau _m - u_{i,s}^{(l)}\left[ t\right] /\tau _s\right) ^{-1}\) at \(\hat{t}_i\). In this regard, Eq. (11) can be re-expressed as
$$\begin{aligned} \dfrac{\partial \hat{t}_i^{(l)}}{\partial u_{i}^{(l)}}\left[ t\right]&= A_i^{(l)}\left[ t\right] \delta \left[ t-\hat{t}_i^{(l)}\right] ,\nonumber \\ A_i^{(l)}\left[ t\right]&= \left( u_{i,m}^{(l)}\left[ t\right] /\tau _m - u_{i,s}^{(l)}\left[ t\right] /\tau _s\right) ^{-1}. \end{aligned}$$
(C38)
Therefore, the error \(\varvec{e}^{(l)}\) in Eq. (9) is re-expressed as the read-out of the variable \(\varvec{\overline{e}}^{(l)}\left[ t\right] \) (calculated at every timestep) upon spiking:
$$\begin{aligned} \varvec{e}^{(l)}&= \sum _{t=1}^{T}\varvec{\overline{e}}^{(l)}\left[ t\right] \odot \varvec{s}^{(l)}\left[ t\right] ,\nonumber \\ \varvec{\overline{e}}^{(l)}\left[ t\right]&= \nabla _{\varvec{\hat{t}}^{(l)}}\mathcal {L}\odot \varvec{A}^{(l)}\left[ t\right] . \end{aligned}$$
(C39)
Theorem 4
The weight update for the folded SNN,
$$\begin{aligned} \Delta \varvec{w}^{(l)} = -\eta diag\left( \varvec{e}^{(l)}\right) \varvec{v}^{(l)}\left[ \varvec{\hat{t}}^{(l)}\right] \text {,}\nonumber \end{aligned}$$
is equivalent to the following equation.
$$\begin{aligned} \Delta \varvec{w}^{(l)}&= -\eta \sum _{t=1}^{T}\left( \varvec{\overline{e}}^{(l)}\left[ t\right] \odot \varvec{s}^{(l)}\left[ t\right] \right) \varvec{v}^{(l)}\left[ t\right] ^\textrm{T},\nonumber \\ \varvec{v}^{(l)}[t]&= \left[ v_1^{(l)}[t],\dots , v_M^{(l)}[t]\right] ^\textrm{T}. \end{aligned}$$
(C40)
Theorem 5
The backward propagation of errors in aggregate
$$\begin{aligned} \varvec{e}^{(l-1)} = \left( \varvec{w}^{(l)\mathrm T}\odot \varvec{v}^{(l)'}\left[ \varvec{\hat{t}}^{(l)}\right] \right) \varvec{e}^{(l)}\odot \hat{\varvec{t}}^{(l-1)'}\nonumber \end{aligned}$$
is decomposed into timestep-wise errors \(\varvec{\tilde{e}}^{(l-1)}[t]\), each of which is calculated at every timestep, as follows:
$$\begin{aligned}\nonumber \varvec{e}^{(l-1)}&= \sum _{t=1}^{T}\varvec{\tilde{e}}^{(l-1)}[t],\nonumber \\ \varvec{\tilde{e}}^{(l-1)}[t]&= \sum _{t'=t}^{T} \left( \varvec{w}^{(l)\mathrm T}\varvec{\tilde{e}}^{(l)}\left[ t'\right] \right) \odot \varvec{B}^{(l)}\left[ t,t'\right] \nonumber \\&\qquad \odot \varvec{A}^{(l-1)}\left[ t\right] \odot \varvec{s}^{(l-1)}\left[ t\right] ,\nonumber \\ \varvec{B}^{(l)}\left[ t,t'\right]&= C_\tau \left[ \tau _m^{-1}e^{- \left( t'-t\right) /\tau _m}-\tau _s^{-1}e^{- \left( t'-t\right) /\tau _s}\right] \varvec{1}.\nonumber \end{aligned}$$
The all-one vector is denoted by \(\varvec{1}=\left[ 1,\dots , 1\right] ^\textrm{T}\).
Theorems 4 and 5 are proven in Appendix B. Theorem 4 identifies the backward propagation of errors at timestep \(t'\) toward timestep t through time. Thus, BPLC + NOSO can be unfolded on a computational graph as shown in Fig. 2, allowing the automatic differentiation framework to be used to learn the weights. Note that we rule out the backward pass from \(sav^{(l)}\left[ t+1\right] \) to \(\varvec{s}^{(l)}\left[ t\right] \) because it can be ignored if the learning uses spike function gradients (rather than surrogate gradients) and refractory periods. This is proven in Appendix  D.

Appendix D Gradient of the spike-availability function with respect to a spike from the previous timestep

Spike-function gradients are non-zero only when the neurons spike unlike surrogate gradients. The same neuron cannot spike at the consecutive timesteps in a row because of the refractory period. Consider the computational graph in Fig. 2. When the ith neuron in the lth layer is quiet at timestep \(t+1\), the gradient \(\partial \hat{t}_i^{(l)}/\partial u_i^{(l)}\left[ t+1\right] \) is zero, so that no gradient flows to \(s_i^{(l)}\left[ t\right] \) regardless of the presence of the backward pass. When the neuron is active at timestep \(t+1\) (i.e., quiet at timestep t), the gradient \(\partial \hat{t}_i^{(l)}/\partial u_i^{(l)}\left[ t+1\right] \) is non-zero. However, the gradient at timestep t is zero, so that the presence or absence of the backward pass does not affect any gradient flow.

Appendix E Hyperparameters

We used the hyperparameters in Table 3. The input scaling factor is an upper limit of the scaled pixel value of input image. We initialized the kernels and weight matrices using the Xavier uniform initialization method given by
$$\begin{aligned} W \sim U\left( -\sqrt{\dfrac{a}{n_{\text {in}} + n_{\text {out}}}}, \sqrt{\dfrac{a}{n_{\text {in}} + n_{\text {out}}}}\right) \text {,} \end{aligned}$$
where a is set to 6. The parameters in NOSONet (32C5-MP2-64C5-MP2-600) on Fashion-MNIST were initialized using the Xavier uniform method. We also initialized NOSONet (64C5-128C5-MP2-256C5-MP2-512C5-256C5-1024-512) on CIFAR-10 using the Xavier uniform method, but the weight matrices for the fully connected layers were initialized using a modified Xavier uniform method with \(a=3\) rather than 6.
Table 3
Hyperparameters used
Parameter
Value
 
F-MNIST
CIFAR-10
Timestep
1 ms
1 ms
Spiking threshold \(\vartheta \)
±0.15 mV
± 0.15 mV
Optimizer
SGD
SGD
Input scaling factor
0.3
0.3
Membrane potential time constant \(\tau _{m}\)
160 ms
165 ms
Synaptic current time constant \(\tau _{s}\)
40 ms
50 ms
# Epochs
80
120
Batch size
64
32
Initial learning rate
5E-3
1E-2
Plateau learning rate
5E-2
Learning rate decay
0.1
0.1
Decay interval
50 epochs
100 epochs
Weight decay rate (L2 regularization)
5E-3
1E-3
# Timesteps
100
100
Initialization
Xavier uniform [14]
Xavier uniform [14]

Appendix F Potential distribution over samples in a batch

Figures 7 and 8 show potential distributions over samples in a random batch (batch size: 300) for single threshold and dual threshold cases, respectively. Note that the distributions exclude zero potential.
Literatur
2.
Zurück zum Zitat Bellec G, Salaj D, Subramoney A, et al (2018) Long short-term memory and learning-to-learn in networks of spiking neurons. In: Advances in Neural Information Processing Systems, vol 31. Curran Associates, Inc Bellec G, Salaj D, Subramoney A, et al (2018) Long short-term memory and learning-to-learn in networks of spiking neurons. In: Advances in Neural Information Processing Systems, vol 31. Curran Associates, Inc
13.
Zurück zum Zitat Gerstner W, Kistler WM (2002) Spiking neuron models: single neurons, populations, plasticity. Cambridge University Press, CambridgeCrossRefMATH Gerstner W, Kistler WM (2002) Spiking neuron models: single neurons, populations, plasticity. Cambridge University Press, CambridgeCrossRefMATH
14.
Zurück zum Zitat Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp 249–256 Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp 249–256
15.
Zurück zum Zitat Hebb DO (1949) The organization of behavior. Wiley, New York Hebb DO (1949) The organization of behavior. Wiley, New York
18.
Zurück zum Zitat Kim G, Jeong DS (2021) Cbp: backpropagation with constraint on weight precision using a pseudo-lagrange multiplier method. In: Advances in Neural Information Processing Systems, vol 34. Curran Associates, Inc Kim G, Jeong DS (2021) Cbp: backpropagation with constraint on weight precision using a pseudo-lagrange multiplier method. In: Advances in Neural Information Processing Systems, vol 34. Curran Associates, Inc
19.
Zurück zum Zitat Kim J, Kim K, Kim JJ (2020a) Unifying activation-and timing-based learning rules for spiking neural networks. In: Advances in Neural Information Processing Systems, vol 33. Curran Associates, Inc Kim J, Kim K, Kim JJ (2020a) Unifying activation-and timing-based learning rules for spiking neural networks. In: Advances in Neural Information Processing Systems, vol 33. Curran Associates, Inc
31.
Zurück zum Zitat Paszke A, Gross S, Massa F et al (2019) Pytorch: An imperative style, high-performance deep learning library. In: Advances in neural information processing systems, vol 32. Curran Associates, Inc Paszke A, Gross S, Massa F et al (2019) Pytorch: An imperative style, high-performance deep learning library. In: Advances in neural information processing systems, vol 32. Curran Associates, Inc
34.
Zurück zum Zitat Shrestha SB, Orchard G (2018) Slayer: Spike layer error reassignment in time. In: Advances in Neural Information Processing Systems, vol 31. Curran Associates, Inc Shrestha SB, Orchard G (2018) Slayer: Spike layer error reassignment in time. In: Advances in Neural Information Processing Systems, vol 31. Curran Associates, Inc
47.
Zurück zum Zitat Zhang W, Li P (2019) Spike-train level backpropagation for training deep recurrent spiking neural networks. In: Advances in Neural Information Processing Systems, vol 32. Curran Associates, Inc Zhang W, Li P (2019) Spike-train level backpropagation for training deep recurrent spiking neural networks. In: Advances in Neural Information Processing Systems, vol 32. Curran Associates, Inc
48.
Zurück zum Zitat Zhang W, Li P (2020) Temporal spike sequence learning via backpropagation for deep spiking neural networks. In: Advances in Neural Information Processing Systems, vol 33. Curran Associates, Inc Zhang W, Li P (2020) Temporal spike sequence learning via backpropagation for deep spiking neural networks. In: Advances in Neural Information Processing Systems, vol 33. Curran Associates, Inc
Metadaten
Titel
BPLC + NOSO: backpropagation of errors based on latency code with neurons that only spike once at most
verfasst von
Seong Min Jin
Dohun Kim
Dong Hyung Yoo
Jason Eshraghian
Doo Seok Jeong
Publikationsdatum
24.02.2023
Verlag
Springer International Publishing
Erschienen in
Complex & Intelligent Systems / Ausgabe 5/2023
Print ISSN: 2199-4536
Elektronische ISSN: 2198-6053
DOI
https://doi.org/10.1007/s40747-023-00983-y

Weitere Artikel der Ausgabe 5/2023

Complex & Intelligent Systems 5/2023 Zur Ausgabe

Premium Partner