Word embedding is a technique for representing words as vectors in a way that captures their semantic and syntactic relationships. The processing time of one of the most popular word embedding technique Word2vec is very large due to the huge data size. We evaluate the performance of a power-efficient FPGA-based accelerator designed using OpenCL. We achieved up to 18.7 times speed-up compared to single-core CPU implementation with the same accuracy. The proposed accelerator consumes less than 83 W of power and it is the most power-efficient one compared to many top-end CPU and GPU-based accelerators.
Hinweise
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
1 Introduction
Word embedding is a technique for representing words as vectors in a way that captures their semantic and syntactic relationships. The idea is to map each word onto a vector such that semantically and syntactically similar words are located near each other in the vector space. These learned vectors can be used as inputs to a variety of natural language processing (NLP) and machine learning applications with excellent results. Word2vec [1, 2] is one of the most popular word embedding methods that transforms words into vectors. It has received considerable attention in recent years and has already been utilized in various applications such as NLP [3], sentiment analysis [4, 5], and big data processing [6].
The processing time of Word2vec can be very large due to the huge amount of data involved. To reduce this processing time, many accelerators have been proposed using high-end CPUs [7‐10] and GPUs [11‐13]. However, these high-end processors often consume a large amount of power, which is a critical concern for applications used in everyday life. Therefore, reducing power consumption is becoming increasingly important. FPGAs have the potential to provide reasonable performance while reducing power consumption. As their performance improves rapidly, FPGAs are increasingly being used in the fields of supercomputers [14] and high-performance computing [15].
Anzeige
FPGAs have been used previously to accelerate Word2vec in a number of studies, including [16, 17]. The former focuses on increasing spatial parallelism, while the latter focuses on increasing temporal parallelism. In Goodman et al. [18], “Packet2Vec” accelerator based on Word2vec has been proposed for network data analysis. In this work, we extend upon previous studies [16, 17] by aiming to increase both spatial and temporal parallelism together. We also conduct a detailed analysis of power consumption, processing speed and accuracy to discuss the trade-offs among those. Based on the experimental results, we achieved a 18.7 times speed-up compared to CPU implementation, and extremely better power-efficiency compared to GPU-based accelerators.
2 Previous work
2.1 Word2vec algorithm
Word2vec is a technique that generates a vector space for a large corpus of text. Each unique word in the corpus is assigned a corresponding vector, and words that share common context are located close to each other in this space. There are two ways to implement Word2vec: the first is to use the context to predict a target word, known as the Continuous Bag of Words (CBOW) model [1, 19]. The second is to use a word to predict a target context, called the Skip-gram model [1, 20].
Let us consider a corpus that contains the words \(w_1, w_2,..., w_T\). Given a word \(w_t\) at position t, the context of \(w_t\) is defined as the set of b words before and after \(w_t\). This context is called a ‘window’ in this paper. When the word \(w_t\) appears in the corpus, the probability of finding a nearby word \(w_{t+j}\) is given by \(P(w_{t+j}|w_t)\). We aim to maximize the objective function L given by Eq. (1) using a two-layer neural network, as shown in Fig. 1.
The corpus consists of a vocabulary of m unique words. The weight vectors of the hidden and the output layers are \(V=[v_1v_2v_3...v_m]\) and \(V^{\prime }=[v^{\prime }_1v^{\prime }_2v^{\prime }_3...v^{\prime }_m]\), respectively. Each vector is d-dimensional. The probability of finding the word \(w_o\) near the input word \(w_i\) is modeled by Eq. (2).
The vector corresponding to the word w is given by \(v_{\text w}\). Since the computation of \(\Sigma ^m_{j=1}\mathrm{{exp}}(v_{{\text w}_i} \cdot v^\prime _{{\text w}_j})\) takes a lot of processing time, a method called negative sampling is used. In this method, both the correct output (positive sample) and a bunch of randomly selected incorrect outputs (negative samples) are used for a given input word. The skip-gram model with negative sampling (SGNS) is proposed in [1, 2]. A detailed explanation of the SGNS model is available in [21]. Since this paper focuses on the hardware implementation of Word2vec, we limit our explanation to the data-flow of the algorithm.
Fig. 1
Neural network of Word2vec
×
Algorithm 1
An extract of the SGNS algorithm
×
Anzeige
2.2 Acceleration of Word2vec
Algorithm 1 shows the sequential computation of the SGNS model. This computation is highly data dependent and difficult to parallelize. The most popular method of acceleration is to use the Hogwild algorithm [22], which computes the outer loop in parallel in multiple threads while ignoring the data dependency among them. Work in [23] also applies this method by deploying multiple processing units to process different portions of the outer loop simultaneously. However, external memory access is a critical problem in this method. From algorithm 1, we can assume that the external memory access is proportional to the following estimation.
As we can see in Algorithm 1, the outputword in line 2, inputword in line 5, negative words in line 14, weight vectors v and \(v'\), are accessed from the external memory. Although the vector \(v'\) can be accessed from the cache since the value of \(l_i\) does not change rapidly, the value of \(l_o\) changes for each loop iteration in line 8 due to different negative words. Since the negative words are selected randomly, the value of \(l_o\) is also decided randomly, making the access of v require a large memory bandwidth. To solve this problem, [7] proposes a method to share negative samples, which uses the same set of n negative words inside a window. This method has been used in many other studies to increase the processing speed, and it has been proven that the method in [7] does not have a noticeable impact on the accuracy of the algorithm. The gradient multiplied by the learning rate ALPHA, is given by g in line 22.
Even with negative sample sharing, the processing speed does not improve after adding a certain number of processing units. Work in Ji et al. [17] proposes a systolic array architecture to employ temporal parallelism by reusing intermediate data. As a result, required memory bandwidth is reduced significantly while allowing parallel data access. In this paper, we further improve the method in [17] by combining Hogwild algorithm with the systolic array architecture. We conducted a comprehensive evaluation to measure the processing speed, power consumption, and accuracy. Based on this evaluation, we discuss how to improve the power-efficiency and accuracy.
3 FPGA-based systolic array architecture
3.1 Data-flow of Word2vec
In this section, we analyze the data-flow of Word2vec algorithm to understand the data dependency. As shown in Algorithm 1, data dependency exists among arrays v and \(v^\prime \). The data of the array v are read in line 20 and written back in line 31. Similarly, data of the array \(v^\prime \) are read in lines 20 and 24, and written back in line 27. All the other data are read-only, so that no data dependency exists. Each vector has d number of dimensions. Therefore, d number of elements are accessed from arrays v and \(v^\prime \) from the base addresses li and lo, respectively. The values of li and lo are decided by inputword and targetword. The targetword equals either to an outputword or to a negativeword.
Algorithm 2 shows a simplified version of the SGNS method, that only focus on the updates of v and \(v^\prime \) arrays. For the simplicity, let us denote \(v[l_i]\) as \(v_i\) and \(v^\prime [lo]\) as \(v^\prime _o\) where \(i, o \in \{1, 2, 3,... \}\). We omitted negativewords which are selected randomly and the data dependency on those are small. The inner loop of Algorithm 2 runs for all words in a window of a fixed size called “window size”. Each loop iteration of the inner loop, the computation is done for \([v_{l_i}, v^\prime _{l_o}]\) pair. Figure 2 shows the data access in each window using a concrete example. The same output vector is used in the computations of all loop-iterations inside a window. Therefore, we cannot process those computations in parallel. However, there is a possibility to process some of the computations belonging to different windows in parallel.
Algorithm 2
Major data dependencies among vectors corresponding to input and output words.
Fig. 2
Data access
×
×
Figure 3 shows a scheduling method that allows parallel processing. Different input and output vectors are scheduled to the same step. The same input vectors v are scheduled after two steps, while the same output vectors \(v^\prime \) are scheduled after one step. The intermediate data of the input and output vectors can be stored in registers and accessed in next steps. Therefore, we can reduce the external memory access dramatically while applying parallel computations. Since all intermediate data are shared, the reduction of the memory access is roughly proportional to “\(\textrm{negative}\_\textrm{samples} + 1\)”.
In this scheduling method, we assume that the words belonging to different positions are not the same. According to this assumption, changing \(v^\prime _{11}\) does not affect \(v^\prime _{12}\). Although this is true for most cases, the same word can exist in different positions occasionally. In such cases, those words are processed in parallel in multiple PEs independently neglecting the updates done by the other PEs. This causes data dependency violation which results in a potential accuracy reduction. We discuss this accuracy reduction in the evaluation.
Fig. 3
Scheduling scheme for parallel computation
×
3.2 Systolic array architecture
Figure 4 shows the proposed architecture to realize the scheduling method shown in Fig. 3. As discussed in Sect. 3.1, the degree of temporal parallelism equals to the window size. Therefore, the number of PEs is 2b, where the there are b words to the left and b words to the right from the target word. At one point, the vector pair \(v_{14}, v^{'}_{11}\) is transferred to the \({\text {PE}}_b\). Then the computation of vectors \(v_{14}\) and \(v^{'}_{11}\) is computed and stored in shift registers. In the next step, the vector pair \(v_{15}, v^{'}_{12}\) is transferred to the \(PE_b\). At the same time, \(v_{14}\) and \(v^{'}_{11}\) is accessed from the shift registers. Note that \(v^{'}_{11}\) was computed in the previous step, while \(v_{14}\) is computed two steps before. In each step, a new pair of vector data are accessed and the previously computed data are transferred to next PEs. After the computation in the final PE (\(\mathrm{{PE}}_{-b}\)), data are transferred to the global memory. The OpenCL environment provides access from the kernel to the external memory on the FPGA board through the memory controller. A kernel is the circuit configured inside the FPGA that manages the computation. All the codes for the memory controller and its connections to the kernel are automatically generated by the OpenCL compiler.
Fig. 4
Systolic array architecture
×
Algorithm 3
A pseudo-code that briefly describes the OpenCL kernel used to define systolic array.
Fig. 5
Expected structure of a PE
×
×
Figure 5 shows the structure of a PE which is automatically constructed using OpenCL. A PE mainly manages the vector product computation in parallel. The input data of a PE is either accessed from another PE or from the external memory. Similarly, the output data are either transferred to the next PE or external memory. The data access between PEs are done through shift registers.
Algorithm 3 shows a pseudo-code that briefly describes the OpenCL kernel used to define the systolic array. The whole systolic array is defined as a single kernel. The shift registers are defined as an array in line 2. Those are initialized in parallel in line 8. In each clock cycle, the data of the shift registers are shifted entirely as shown from lines 13 to 20. The processing of a window position b (or a PE b) is shown from lines 23 to 37. The data from the shift registers are read and the vector computations are done in parallel. Then the processed data are copied to the shift registers. Since all “if” statements in the loop at line 11 are processed in parallel, all PEs are also executed parallel.
Each vector is d-dimensional, so that there are d number of computations as shown in algorithm 1. We compute P number of such computations in parallel in each PE. Therefore, each PE has a P degree of spatial parallelism and also all PEs collectively have 2b degree of temporal parallelism. In addition, it is possible to execute more than one such systolic arrays in parallel until the data transfer bottleneck is met. As shown in Fig. 6, we execute multiple systolic arrays where all share the same external memory. This method is called “Hogwild” algorithm which is discussed in [22]. However, this increase the memory bandwidth significantly. Therefore, we evaluated the proposed method on a HBM (High Bandwidth Memory)-based FPGA, which has a larger bandwidth compared to conventional external memory-based FPGAs.
Fig. 6
Hogwild
×
4 Evaluation
For the evaluation, we utilized two FPGA boards, namely “Stratix10 GX 520N” and “Stratix10 MX 520N-MX”. The former is equipped with four DDR4 memory modules connected to the FPGA as external memory, offering a maximum bandwidth of 76.8GB/s. The latter incorporates an HBM2 memory within the FPGA chip, providing a theoretical bandwidth of 512GB/s. The CPU employed in our setup is the Intel Xeon Gold 6226R. It is connected to 192GB DDR4 memory. To compile the FPGA kernels, we employed Intel FPGA SDK for OpenCL 19.4, while the host code was compiled using gcc 10.2.0. Our implementation of word2vec is based on the code published in [24]. For training, we utilized the “news.shuffled.en.2010 2.1GB” sample from the “one billion word benchmark” [25]. The precision of the FPGA accelerator is 32-bit floating-point.
Fig. 7
Processing speed comparison against the degree of spatial parallelism
×
4.1 Evaluation of the proposed architecture
In this section, we evaluate the proposed architecture under different parameters and conditions. Figure 7 shows the processing speed against the spatial parallelism achieved by changing the number of kernels. Figure 7a shows the processing speed given as kilo words per second. According to the results, speed-up is increased up to 2-kernels when employing the Stratix10 GX 520N FPGA board. However, the processing speed decreased for the 3-kernel implementation, which can be attributed to a significant reduction in the clock frequency, and the small memory access bandwidth. After referring the “OpenCL profiler”, we can see a large memory stall percentage due to lack of enough memory bandwidth. When utilizing the Stratix10 MX 520N-MX FPGA board, the processing speed increases with the number of kernels despite the decrease in clock frequency. In order to explain this clearly, we use Fig. 7b which shows the processing speed given as kilo words per clock cycle. Since the effect of the clock frequency is eliminated, we can see a continuous increase in the processing speed with the number of kernels. Note that the increase in the processing speed is larger for Stratix10 MX. This is due to the larger memory bandwidth in HBM2-based Stratix10 MX. Note that 3-kernel implementation utilizes over 80% of the resources of Stratix10 MX, and we cannot increase the number of kernels further.
Fig. 8
Comparisons against multicore CPU implementation for speed and accuracy
×
Comparison against 16-core Intel Xeon Gold 6226R CPU implementation for processing speed and accuracy is shown in Fig. 8. The CPU codes are compiled using gcc 10.2.0 with the optimization options “O3, march=native, pthread”. The precision is 32-bit floating-point, similar to that of the FPGA accelerator. As shown in Fig. 8a, CPU implementations getting faster when using more cores. However, 3-kernel Stratix10 MX implementation is over two times faster than the 16-core CPU implementation. On the other hand, the CPU implementations has a better accuracy compared to all FPGA implementations, as shown in Fig. 8b. We measured the accuracy using “Google analogy data-set”. The accuracy of all CPU implementations are nearly the same for all cores. Similarly, the accuracy of the FPGA implementations are also consistent for different number of kernels.
Fig. 9
Power consumption of the FPGA implementation. Window size of the FPGA accelerator is 6 and the number of negative samples is 3
Fig. 10
Comparisons against multicore CPU implementation for power and energy
×
×
Figure 9 shows the power consumption of the 2-kernel FPGA implementation. Power consumption of FPGA board is measured by reading the power sensor data in real time while the accelerator is at the operational state. Power consumption of both the external memory assess and computation is included in this power value. Power sensor data are accessed real time in every 500 ms using a separate CPU thread. Power consumption is almost consistence with a small fluctuation between 80 and 83 W. We use the maximum power consumption value for our power analysis.
Comparison against multicore CPU implementation for power and energy is shown in Fig. 10. Power of the whole server with the CPU is measured continuously in real time using two methods. One method is to read power sensors using “lm_sensors” software. The other method is to use the power tester called “Hioki 3334 AC/DC power Hitester”. Both methods provide similar results with less than 1.2% difference. Note that, we removed FPGAs before measuring the power. According to the results, server consumes 81 W of power at the idle state and 212 W when running at 16 cores. Therefore, we calculated the operating power by deducting the server power at the operational state from its idle state. According to the results, CPU and external memory consume 131 W operating power. This power consumption value does not contain the static component and we think it is slightly underestimated.
As shown in Fig. 10a, when using more than 8 cores, the CPU power consumption exceeds that of an FPGA. The power-efficiency, which is measured by dividing the processing speed by the power consumption, is shown in Fig. 10b. The power-efficiency of the CPU increases when using more cores in parallel. However, the power-efficiency of FPGA is nearly 3 times larger than that of the most power-efficient CPU implementation that uses all 16 cores. Figure 10c shows the energy consumption which is calculated by multiplying the power and processing time. We can see that the FPGA implementations consume the least amount of energy.
The remainder of this section shows the evaluation of the proposed temporal parallelism. Therefore, we use 1-kernel implementation for the evaluation to eliminate the effects of the conventional spatial parallelism using multiple kernels. The input sample contains 8GB of data. Figure 11 shows the comparison of the proposed FPGA implementation against the CPU for different negative sample sizes. As shown in Fig. 11a, the processing times of both FPGA and CPU implementations are increased similarly. Number of negative samples increases the computation amount, and that leads to the processing time increase. Figure 11b shows the comparison of accuracy for different negative sample sizes. In the CPU implementation, the accuracy is similar for different number of negative sample sizes. In the FPGA implementation, there is an increase in the accuracy initially, but then settle down to nearly a constant value. A possible reason for this could be the data dependency violations in FPGA implementation. Figure 11c shows the resource usage of the FPGA implementation. Since there is no change of the degree of parallelism, the resource usage remains the same.
Fig. 11
Comparison of FPGA and CPU implementations when changing the negative sample size. Horizontal axis represents the number of negative samples. Processing time is shown in log scale
×
Figure 12 shows the comparison of the proposed FPGA implementation against the CPU implementation for different vector sizes. The processing times of both FPGA and CPU implementations are increased similarly as shown in Fig. 12a. This is due to the increase in the computation amount for large vector sizes. Increasing the vector size usually results in representing a word in detail. Therefore, the accuracy of both FPGA and CPU implementations are increased as shown in Fig. 12b. Since the initial accuracy of the FPGA implementation is low, we can see a greater improvement of the accuracy with the vector size. The resource usage remains the same as shown in Fig. 12c, since the degree of parallelism is the same.
Fig. 12
Comparison of FPGA and CPU implementations when changing the vector dimension. Horizontal axis represents the number of dimensions. Processing time is shown in log scale
×
Figure 13 shows the comparison of the proposed FPGA implementation against the CPU for different window sizes. The processing times of CPU implementations are increased while those of FPGA implementations are unchanged as shown in Fig. 13a. The computation amount increases with the window size, and that increases the processing time of CPU. However, multiple window positions are processed in parallel in FPGA using multiple PEs. Due to this parallel processing, the processing time the FPGA implementation is unchanged. As shown in Fig. 13b, there is a very small improvement in the accuracy of the CPU implementation with the window size. On the other hand, there is a significant reduction of the accuracy of the FPGA implementation. Increasing the window size also increases the data dependency violations in the FPGA implementation, and that reduces the accuracy. Since the number of PEs are increased with the window size, the resource usage also increases as shown in Fig. 13c. Since the resource increase is linear, 26 is the maximum window size possible, which require over 80% of the FPGA logic resource.
Fig. 13
Comparison of FPGA and CPU implementations when changing the window size. Horizontal axis represents the window size. Processing time is shown in log scale
×
Figure 14 shows the comparison of accuracy for different data sample sizes. In this evaluation, window size is 6, vector size is 160 and negative sample size is 3. The accuracy of the single thread CPU implementation is always larger than that of the FPGA implementation. However, accuracy improves with the data size similarly for both implementations. This shows that even with a large data size, 1-thread CPU implementation always has the better accuracy.
Fig. 14
Accuracy against the data size
×
According to the above results, we can see that the FPGA implementation is faster but the CPU implementation is more accurate. To get a better comparison, we measure the processing time of both CPU and FPGA implementations when the accuracy is the same. We use the same data sample but changed the vector size to alter the accuracy. As shown in Fig. 15, the FPGA implementation is 11.8–18.7 times faster compared to the CPU implementation. Therefore, the proposed FPGA implementation has a clear advantage in processing speed compared to the CPU implementation.
Fig. 15
Processing speed comparison against the same accuracy
×
4.2 Comparison with previous accelerators
Figure 16 shows the processing time comparison against previous CPU, GPU, and FPGA implementations. We can see that the proposed FPGA implementation is faster than all previous works except [12] which uses a GPU. In this evaluation, the number of negative samples is 3 and the window size is 6. The processing speed is measured as the number of kilo words per second.
Fig. 16
Processing speed of the FPGA implementation. Window size is 6 and number of negative samples is 3
×
Figure 17 shows the power-efficiency which is measured by dividing the processing speed by the power consumption. In Stratix10 GX FPGA board, power is measured in real time reading the on-board sensor data in every 500 ms, and the largest value is selected as the power consumption. For the other GPUs and CPUs, the power consumption details are not available in previous studies. Therefore, we use the max TDP value as a reference. Although the GPU implementations are severely optimized in previous works, this power consumption value could be an overestimate for GPUs. However, it can be an underestimate for CPUs since we have not considered the power consumption of the external memory. According to the results, the power-efficiency of the proposed method is many times larger compared to other methods. Therefore, we can clearly say that the FPGA implementation is the most power-efficient one.
Fig. 17
Power-efficiency of the FPGA implementation. Window size is 6 and number of negative samples is 3
×
5 Conclusion
In this paper, we combined different temporal and spatial processing techniques of Word2vec into a one accelerator architecture, and implemented it on FPGAs. We evaluate the performance of the proposed architecture and compared it against existing CPU and GPU-based accelerators. According to the results, the proposed FPGA accelerator is up to 18.7 times faster while maintaining the same accuracy as that of CPU implementation. We discuss several methods to increase the accuracy such as increasing vector size, and sample size. We also discussed the disadvantages of the proposed method such as accuracy reduction and area increase for larger window sizes. The proposed accelerator is extremely power-efficient compared to previously proposed accelerators. It also has a sufficiently larger processing speed. Since Word2vec is a key processing in other embedding methods such as Doc2vect [26, 27], Subgraph2vec [28], and Graph2vec [29], the proposed accelerator could be useful in future applications. Note that the code of proposed accelerator can be converted to OneAPI environment to be used in future FPGAs.
Declarations
Conflict of interest
The authors declare no conflict of interest.
Ethical approval
Not applicable
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.