24.07.2019  Ausgabe 5/2019 Open Access
Biological Neuron Coding Inspired Binary Word Embeddings
 Zeitschrift:
 Cognitive Computation > Ausgabe 5/2019
Wichtige Hinweise
Yuwei Wang and Yi Zeng have equal contribution to this work and should be regarded as cofirst authors
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Introduction
Word embeddings models can convert both semantic and syntactic information of words into dense vectors, for example, Word2Vec [
1] and GloVe [
2]. Recently, they attract a lot of attention due to their good performances in various natural language processing tasks, such as language modeling [
3], parsing [
4], sentence classification [
5], and machine translation [
6].
However, these dense representations are mostly derived from statistical property of large corpus while are lack of interpretability in each dimension of the word vectors. Several works have tried to transform dense word embeddings into sparse ones to improve the interpretability. Murphy et al. introduced a matrix factorization algorithm named nonnegative sparse embeddings (NNSE) on cooccurrence matrix to get sparse, effective and interpretable embeddings [
7]. Faruqui et al. defined a
L
_{1} regularized objective function and proposed an postprocess optimization algorithm to convert original dense embeddings into sparse or binary embeddings. They call them sparse or binary overcomplete word vector [
8]. Sun et al. introduced an algorithm to get sparse embeddings during training Word2Vec model through
L
_{1} regularizer on cost function and regularized dual averaging optimization algorithm [
9]. For binary word embeddings, there are also some rounding algorithms on converting dense vectors into discrete integer values to reduce memory. Ling et al. proposed postprocessing rounding, stochastic rounding, and auxiliary update vectors algorithms for word embeddings with limited memory, which is named as truncated word embeddings [
10]. The interpretability issue in these works is mentioned but not demonstrated clearly. In this paper, we want to improve it via a braininspired approach, explaining each dimension of word embeddings based on neuron coding models.
Anzeige
In biological brains, the encoding of information in the areas such as inferior temporal visual cortex, hippocampus, orbitofrontal cortex and insula is with sparse distributed representation [
11]. Many experimental evidences have indicated that biological neural systems use the timing of spikes to encode information [
12–
14]. The spike trains of cell activities during information transition inspire us to combine traditional word embeddings and neuron coding models into binary embeddings. In this paper, we perform postprocess operations on original dense word embeddings to get binary ones with inspirations from biological neuron coding models, and the proposed binary embeddings are with less space occupation and with better interpretability than previous models.
Related Works
Neuron Coding
Neuron coding is concerned with describing the relationship between the stimulus and the neuronal responses [
15]. A great many efforts have been dedicated to developing techniques to enable the recording of the brain’s electrical activity at different spatial scales, such as single cell spike train recording, local field potential (LFP), and electroencephalogram (EEG) [
16]. Neuron coding models mainly concern how neurons encode, transmit, and decode information, and their main focus is to understand how neurons respond to a wide variety of stimuli, and to construct models that attempt to predict responses to other stimuli.
Neurons propagate signals by generating electrical pulses called action potentials: voltage spikes that can travel down nerve fibers. For example, sensory neurons change their activities by firing sequences of action potentials in various temporal patterns, with the presence of external sensory stimuli, such as light, sound, taste, smell and touch [
16]. It is known that information about the stimulus is encoded in action potentials and transmitted through connected neurons in our brains.
There are various kinds of hypotheses on neuron coding based on recent neurophysiological findings on biological nervous system, mainly including spike rate coding and spike time coding. For spike rate coding, only the firing rate in an interval is concerned as a measurement for information carried. Rate coding is firstly motivated by the observation of the frog cutaneous receptors by Adrian et al. in 1926 that physiological neurons tend to fire more often for stronger stimuli [
17]. Spike rate coding has been the main paradigm in artificial neural networks, such as sigmoidal neurons. Meanwhile, the Poissonlike rate coding is widely used by physiologists to describe how the neurons transmit information. Recently, some neurophysiological results show that efficient processing of information is more likely based on precise timing of action potentials rather than firing rate in some biological neural systems [
18–
20]. For timing coding hypotheses [
21], they mostly concentrate on the timing of individual spikes and the typical ones are the time to first spike [
22,
23], rank order coding [
20,
24], latency coding [
25], and phase coding [
26].
Anzeige
In our study, we use Poissonlike coding for spike rate coding and various spiking neuron models for time coding. We try to apply these biological neuron coding hypotheses to build binary word embedding models.
Spiking Neural Network Models
Spiking neural networks (SNNs), which are highly inspired from recent advancement in neuroscience, are often referred as the third generation neural network models [
27]. Different from traditional neural networks, SNNs consider the timing of individual spikes as the means of communication and neural computation [
21].
Spiking neuron models are the basis of SNNs, which describe the properties of certain cells in the nervous system that generate spikes across their cell membrane. The most wellknown neuron model is HodgkinHuxley model (HH model). In 1952, Hodgkin and Huxley did experiments on the giant axon of squid with the voltage clamp technique, which punctured the cell membrane and allowed to force a specific membrane voltage or current [
28]. The model was proposed by the recordings and fitting results, well describing the change of ion channel and neuron behavior after stimulation.
In the HH model [
29], the semipermeable cell membrane separates the interior of the cell from the extracellular liquid and acts as a capacitor. Because of the active ion transportation through the cell membrane, the ion concentration inside the cell is different from that in the extracellular liquid. The Nernst potential generated by the difference in ion concentration is represented by a battery.
The model takes three types of channel into consideration: a sodium channel, a potassium channel, and an unspecific leakage channel with resistance
R. From the definition of a capacity
C =
Q/
v where
Q is a charge and
v is the voltage across the capacitor, thus:
The leakage channel is described by a voltageindependent conductance
g
_{L} = 1/
R. For the sodium channel and the potassium channel, if both of them are open, they transmit currents with a maximum conductance
g
_{Na} or
g
_{K}, respectively. However, the channels are not always open; the probability that a channel is open is described by additional variables
m,
n, and
h. The combined action of
m and
h controls the
N
a
^{+} channels while the
K
^{+} gates are controlled by
n.
The parameters
E
_{Na},
E
_{K}, and
E
_{L} are empirical parameters and the gating variables
m,
n, and
h are defined by differential equations [
28].
$$ C\cdot \frac{dv}{dt} = \sum\limits_{k} I_{k}(t)+I(t) $$
(1)
$$ \sum\limits_{k} I_{k} = g_{Na}m^{3} h (vE_{Na} )+g_{K}n^{4} (vE_{K} ) + g_{L} (vE_{L} ) $$
(2)
In addition to the HH model, other types of spiking neuron models have been proposed, such as integrateandfire models and variants, Izhikevich’s neuron model, and spike response model (SRM). Recently, SNNbased models have been applied in variant AI applications, such as character recognition [
30,
31], object recognition [
32], image segmentation [
33], speech recognition [
34], robotics [
35], knowledge representation [
36], and symbolic reasoning [
37]. In this paper, we will use leaky integrateandfire model and Izhikevich’s neuron model to convert the word embeddings into more explainable binary embeddings.
Word Embedding Models Based on Inspirations from Biological Neuron Coding
The Framework
We build unsupervised models for postprocessing binary word embeddings based on two types of braininspired models, homogeneous Poisson process and spiking neural networks. Based on preprocessed word embeddings, such as Word2Vec and GloVe, these models convert original dense embeddings into the form of binarization. Different from traditional works on binary word representations, our models are inspired by neuroscience which are biologically plausible and more interpretable.
To mimic information transmission in biological brains, we take temporal information into consideration. As Fig.
1 shows, our models combine original dense word embeddings and neural coding algorithms to get the spiking times of neurons during a given period of time. We denote the original dense word embeddings matrix as
W, for each element
w
_{id}, where
i = 1,2,⋯ ,
N,
d = 1,2,⋯ ,
D, 
N represents the total number of words and 
D represents the dimensions of each word. For each word, we build a neural model based on the value of each dimension. And during a given time
T, we record the membrane potential for each neuron per
Δ
t, via neural coding algorithms which will describe in “
Homogeneous Poisson ProcessBased Binary Word Embeddings” and “
Spiking Neural Networks Based Binary Word Embeddings.” Then, spiking times matrix
S
^{(i)}, which contains all neurons’ spiking times for the
i th word, will be flattened as a vector
f
^{(i)} with each row concatenated head to tail. The dimensions for
f
^{(i)} is 
D× (
T/
Δ
t). Finally, to make our model more robust, we introduce the tolerance factor
t
o
l. We allow a window of
t
o
l ×
Δ
t to generate a binary bit, and obtain the binary word embeddings in the following way:
The
\(\mathcal {T}(vector)\) operation means that if there are 1s in the vector, then the bit is 1, otherwise it is 0.
$$ \begin{array}{@{}rcl@{}} \mathbf{b}^{(i)} &=& [\mathcal{T}(\mathbf{f}^{(i)}_{1:1*tol)}),\mathcal{T}(\mathbf{f}^{(i)}_{1*tol + 1: 2*tol)}) \cdots ,\mathcal{T}(\mathbf{f}^{(i)}_{(k1)*tol+1:k*tol)}),\\ &&\cdots , \mathcal{T}(\mathbf{f}^{(i)}_{D\times (T/ {\varDelta} t)  tol:D\times (T/ {\varDelta} t)})] \end{array} $$
(3)
×
Homogeneous Poisson ProcessBased Binary Word Embeddings
Poissonlike rate coding is a major algorithm to simulate spiking response to stimuli. Biological recordings from medial temporal [
38,
39] and primary visual cortex [
40] of macaque monkeys have shown good evidence for Poisson processbased coding.
For homogeneous Poisson process, it assumes that for the current spike, there is no dependence at all on preceding spikes and the instantaneous firing rate
r is constant over time. Consider that we are given a interval (0,
T) and we place a single spike in it randomly. If we pick a subinterval (
t,
t +
Δ
t) of length
Δ
t, the probability that the spike occurred in the subinterval equals
Δ
t/
T. When we place
k spikes in (0,
T), according to binomial formula, the probability that
n of them fall in (
t,
t +
Δ
t) is:
Keeping fire rate
r =
k/
T constant, we increase
k and
T synchronously. As
k →
∞, the probability becomes:
This is the probability density function for Poisson distribution.
$$ P\{n \ spikes \ during \ {\varDelta} t \} \!=\! \frac{k!}{(k\!\!n)!n!}({\varDelta} t /T)^{n}(1\!\!{\varDelta} t /T)^{kn} $$
(4)
$$ P\{n \ spikes \ during \ {\varDelta} t \} = \frac{(r{\varDelta} t)^{n}}{n!}e^{r{\varDelta} t} $$
(5)
In our homogeneous Poisson processbased binary word embeddings model, we consider each dimension as an independent homogeneous Poisson process and the normalized value of the dimension
\(w_{id}^{normalized}\) as the constant firing rate. Following the spike generator within the program, for each
Δ
t in the interval (0,
T), we compare
\(w_{id}^{normalized}\cdot {\varDelta } t\) with a random variable
x
_{random}. Then, we can get the spiking time matrix in this way:
$$ w_{id}^{normalized}\cdot {\varDelta} t = \left\{\begin{array}{l} > x_{random} \ \ fire \ a \ spike \\ \leq x_{random} \ \ nothing \end{array}\right. $$
(6)
Spiking Neural Networks Based Binary Word Embeddings
The LIFBased Binary Word Embedding Model
The leaky integrateandfire (LIF) neuron model, a simplified version of HH model, is one of the simplest spiking neuron models [
41]. LIF model is widely used because it is biologically realistic and computationally simple to be analyzed and simulated [
31,
42,
43].
In the LIF model, as Eq.
7 shows,
v is the membrane potential,
τ
_{m} is the membrane time constant, and
R is the membrane resistance, and for LIFbased word embeddings model, we replace the input current
I with the product of the
d th dimension value of the
i th word and current boost factor
I
_{boost}.
In our LIFbased binary word embedding model, we regard the value
I
_{boost} ⋅
w
_{id} as the intensity of current for neurons, and we get the spiking time matrix based on the record of membrane potential
v. In addition, we also try to add white noise to the current to improve its robustness.
$$ \tau_{m} \frac{dv}{dt} = v(t)+R\cdot I_{boost} \cdot w_{id}, \ \ if \ v(t) > v_{th}, \ v(t)\leftarrow v_{r} $$
(7)
The Izhikevich NeuronBased Binary Word Embedding Model
The Izhikevich neuron model is not only capable of producing rich firing patterns exhibited by real biological neurons but also computationally simple [
44]. The model makes use of bifurcation methodologies [
45] to reduce more biophysically accurate HH neuron model to a simple one of the following form:
If
v(
t) ≥
v
_{th}, then
v(
t) ←
c and
u(
t) ←
u(
t) +
d.
$$ \frac{dv}{dt} \!=\! 0.04v(t)^{2} + 5v(t)+140u(t)+I, \! \ \frac{du}{dt} \!=\! a(bv(t)u(t)) \ \ $$
(8)
In the Izhikevich neuron model, the meaning of
v,
v
_{th}, and
v
_{r} are the same as in the LIF model, while
u represents the membrane recovery variable and
a,
b,
c, and
d are four important hyperparameters. The parameter
a describes the time scale of
u,
b describes the sensitivity of
u to the subthreshold fluctuations of
v, and
c is used to describe the afterspike reset value of
v and is caused by fast highthreshold
K
^{+} conductances.
d is used to describe the afterspike reset of
u and is caused by slow highthreshold
N
a
^{+} and
K
^{+}.
As Izhikevich et al. [
44] shows, different choices of these four parameters can simulate different types of neurons in the mammalian brains, such as excitatory cortical cells, inhibitory cortical cells, thalamocortical cells, etc. In this paper, we mainly focus on excitatory and inhibitory cortical neurons. According to the intracellular recordings, cortical cells can be divide into different types, for example, regular spiking (RS), intrinsically bursting (IB), and chattering (CH) for excitatory neurons while fast spiking (FS) and lowthreshold spiking (LTS) for inhibitory neurons.
In our Izhikevich neuron modelbased binary word embedding models, we make use of the combination of excitatory and inhibitory neurons at the rate of 4:1, which is motivated by the rate in mammalian cortex [
44]. As mentioned before, for each word, we set 
D neurons and regard the product of the original word embeddings
w
_{id} and a factor
I
_{boost} as the the current for the model. We set each neuron to excitatory/inhibitory submodels, and for different dimensions of each word, we get the spike times according to its submodels.
Experiment Validations
Validation Tasks and Datasets
We evaluate our binary embeddings on word similarity and text classification tasks. The word similarity task has been widely used to measure in which degree the word embeddings can capture the similarity between two words, while the text classification task is a traditional NLP application. In our experiment, all the binary word embedding models are based on two kinds of wellaccepted original word embeddings, namely, Word2Vec [
1] and GloVe [
2].
For word similarity task, we find similar words via Hamming distance, which will be faster than traditional cosine distance for dense embeddings and we evaluate embeddings on three public datasets: (1)
WordSim353, it is the most widely used dataset for word similarity test, consisting of 353 pairs of words [
46]; (2)
SimLex999, it consists of 999 pairs of words and provides a way of measuring how well the word embeddings capture similarity, rather than relatedness or association [
47]; (3)
Rare Words, it consists of 2,034 word pairs proposed by Luong et al. [
48], focusing on rare words to complement exiting ones. All these pairs of words are along with humanassigned similarity scores and we check Spearman’s rank correlation coefficient between word embeddings and the human labeled ranks.
For the text classification task, we do OR operation on binary embeddings to generate the representation for text and use the
knearest neighbors (kNN) classifier to measure accuracy. We validate our algorithms on two public text datasets: (1)
Search Snippets, it is a short text dataset collected by Phan et al. [
50], which is selected from the results of Web search transaction using predefined phrases of 8 different domains; (2)
Sentiment Analysis, it is proposed by Socher et al. [
49] and is a treebank of sentences annotated with sentiment labels from movie reviews. The sentences in the treebank were split into a train (8544), dev (1101), and test splits (2210). We merge the train and dev part for the kNN classifier and ignore neutral sentences, analyzing performance on only positive and negative class.
Experiment Details and Results
In our experiment, we use the pretrained GloVe
^{1} and Word2Vec
^{2} embeddings, both of which are 300 dimensions. We set three comparative experiments of original embedings, binary embeddings, “OvercompleteB” derives from Faruqui’s work [
8], and “Rude Binarization” convert original embeddings into binary ones via simple sign function.
For all the biological neuron codinginspired models, we set the interval
T = 10 ms and subinterval
Δ
t = 0.1 ms. We find the best hyperparameter through gridsearch on word similarity tasks and apply these for both experiment tasks. For Poisson, LIF, and LIF with noisebased model, the
t
o
l is 5, while for other models,
t
o
l is 10. For LIF and LIF with noise model,
τ
_{m} = 10 and
v
_{th} = 15, while for Izhikevich model,
v
_{th} = 30, and other parameters follow [
44] for different submodels. The
I
_{boost} factors are 100 and 200 for GloVe and Word2Vec respectively. In Addition, for Poisson coding and LIF with noise model, we do 10 times for each, with different random seeds, and Table
1 shows the average and their standard deviation results.
Table 1
Results of word embeddings on the word similarity tasks
–

GloVe

Word2Vec

Average



WordSim

SimLex

Rare Words

WordSim

SimLex

Rare Words


Original

56.45

39.20

33.57

61.29

45.73

44.27

46.75

OvercompleteB

53.23

41.21

38.90

41.94

42.71

33.97

41.99

Rude binarization

58.06

47.24

34.35

54.84

44.22

34.73

45.57

Poisson

30.16 ± 6.23

27.79 ± 6.27

15.94 ± 4.02

29.03 ± 9.20

27.64 ± 7.62

16.34 ± 4.05

24.48

LIF

66.13

42.21

31.69

62.90

55.28

59.16

52.90

LIFnoise

51.61 ± 6.31

35.80 ± 5.50

20.87 ± 4.58

50.81 ± 2.80

49.50 ± 3.78

16.79 ± 0.90

37.56

IzhCH+FS

62.90

40.70

32.45

51.61

41.71

39.31

44.78

IzhCH+LTS

62.90

42.71

33.59

45.16

39.20

37.40

43.49

IzhIB+FS

66.13

40.20

31.69

51.61

40.20

32.44

43.71

IzhIB+LTS

62.90

38.19

33.21

54.84

43.22

30.15

43.75

IzhRS+FS

66.13

44.22

30.93

48.39

51.76

32.82

45.71

IzhRS+LTS

62.90

41.71

30.55

48.39

44.72

38.93

44.53

Result Analysis
Through analysis from the data shown in Tables
1 and
2 and Fig.
3, we can infer that: (1) We make an exploration on how to generate binary embeddings via biological neuron codinginspired models (Figs.
2 and
3. The results show that the SNNbased models show good performance while the Poisson codingbased model reflected rate coding’s weakness when transforming dense information into binary bits. Which means, it cannot carry enough information to represent stimuli or patterns. (2) For word similarity task, binary word embeddings, especially rude binarization, LIFbased, and Izhkevichbased models which are transformed through dense word embeddings, can get similar results to original ones. (3) The LIFbased binary embeddings model performs well on word similarity tasks while somehow bad on text classification task. This may due to over simplified mechanism of LIF model, making it robust to represent words while lost many semantic information; LIF model with noise can improve the performance of text classification task, while it is unstable and can pull down the word similarity results. (4) The Izhkevich neuronbased binary embedding model gets excellent results on both tasks, especially the combination of RS and FS neuron submodels is the best one. The model combines the excitatory and inhibitory neurons to mimic the neurons in the biological brain, making a difference when converting the original dense embeddings into binary ones. (5) From the perspective of space occupation, for database of 3 million words (such as the public pretraining Word2Vec vectors) with 300 dimensions takes 3.6 GB in floating point while 1.125 GB as 3000bit codes (
t
o
l = 10) for the Izh_RS+FS model, which reduced approximately 68.75% space occupation. For neuron codingbased binary embeddings models, the compression ratio is mainly due to the run time and the tolerance factor
t
o
l.
Table 2
Summarized results of two tasks
Methods

Word similarity

Text classification

Average


OvercompleteB

41.99

57.19

49.59

Rude binarization

45.57

36.12

40.88

Poisson

24.48

46.72

35.60

LIF

52.90

46.79

49.84

LIFnoise

37.56

53.51

45.53

IzhCH+FS

44.78

56.86

50.82

IzhCH+LTS

43.49

57.87

50.68

IzhIB+FS

43.71

62.69

53.20

IzhIB+LTS

43.75

63.83

53.79

IzhRS+FS

45.71

70.23

57.97

IzhRS+LTS

44.53

69.68

57.11

×
×
Conclusion
In this paper, we propose three kinds of biological neuron codinginspired models to generate binary word embeddings, which show better performance and interpretability compared to existing works on word similarity evaluation and text classification task. To the best of our knowledge, this is the first attempt to convert the dense embeddings into binary ones via spike timing, and we have proved its feasibility on some natural language processing applications.
Future Work
Due to the limitation on the performance of supervised SNNs, in this paper, we do postprocessing operations on given word embeddings. However, we are looking forward to build SNNbased language model to get braininspired word embeddings from the raw corpus. We are trying to adjust the cost function of supervised SNNs and add several biological mechanisms such as STDP to the model to get them. Furthermore, in contrast to excitatory neocortical neurons, which have stereotypical morphological and electrophysiological classes, inhibitory neocortical interneurons have wildly diverse classes with various firing patterns that cannot be classified as FS or LTS [
45]. In this paper, we focus on FS and LTS inhibitory neurons for their parameters in Izhikevich’s neuron model are easy to get. In the future, we will pay more attention to more detailed types of inhibitory neuron models.
Compliance with Ethical Standards
Ethical Approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Conflict of Interest
The authors declare that they have no conflict of interest.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.