nach oben

International Journal of Parallel Programming

Erschienen in:

Open Access 13.02.2024

Access Interval Prediction by Partial Matching for Tightly Coupled Memory Systems

verfasst von: Viktor Razilov, Robert Wittig, Emil Matúš, Gerhard Fettweis

Erschienen in: International Journal of Parallel Programming | Ausgabe 1-2/2024

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Patentsuche

Aus

Abstract

In embedded systems, tightly coupled memories (TCMs) are usually shared between multiple masters for the purpose of hardware efficiency and software flexibility. On the one hand, memory sharing improves area utilization, but on the other hand, this can lead to a performance degradation due to an increase in access conflicts. To mitigate the associated performance penalty, access interval prediction (AIP) has been proposed. In a similar fashion to branch prediction, AIP exploits program flow regularity to predict the cycle of the next memory access. We show that this structural similarity allows for adaption of state-of-the-art branch predictors, such as Prediction by Partial Matching (PPM) and the TAgged GEometric history length (TAGE) branch predictor. Our analysis on memory access traces reveals that PPM predicts 99 percent of memory accesses. As PPM does not lend itself to hardware implementation, we also present the PPM-based TAGE access interval predictor which attains an accuracy of over 97 percent outperforming all previously presented implementable AIP schemes.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

1 Introduction

Since the end of Dennard scaling, computer architects have shifted to exploiting thread-level parallelism by integrating multiple processing elements (PE) as a system on chip. In area- and energy-constrained embedded systems, PEs may communicate via a shared scratchpad tightly coupled memory (TCM) to increase area utilization. A drawback of this approach is the introduction of access conflicts. Since the access latency of TCMs is usually limited to a single clock cycle, on-line conflict resolution and arbitration logic may degrade the clock frequency, and thus the system performance. The degradation becomes more severe when different run-time priorities have to be considered, as is the case in mobile communication platforms [1], for example.

Access interval prediction (AIP) has been proposed to resolve TCM access conflicts in advance [2, 3]. Memory accesses of PEs are predicted in the time domain so that an off-line arbiter speculatively allocates memory access time slots to one of the them and configures access logic accordingly. Logic on the path between the PEs and the TCM is kept to a minimum while priorities can be accounted for.

In an AIP system, mispredictions lead to misallocations. Therefore, a high prediction accuracy is crucial. AIP schemes such as the access look-ahead buffer (ALB) [2] or the maximum a-posteriori (MAP) estimator [3] achieve average hit rates of 50 % (c.f., Sect. 6) or have not been designed with hardware implementation in mind, respectively.

To find better-performing predictors, we turn to branch prediction for inspiration because it is a well-researched problem that is similar to AIP. In the former, the outcomes of conditional branches are predicted and instructions are executed speculatively based on the prediction. A plethora of high-precision branch predictors have been proposed [4]. Of these, the TAGE branch predictor (TAGE-BP) [5], which approximates prediction by partial matching (PPM) [6], is widely considered to be state of the art [4]. Fundamentally, both work by recording branch outcome sequences of multiple history lengths.

In this paper, we leverage the maturity of both prediction methods for AIP. We adapt them to work on sequences of interval lengths and evaluate their performance on access traces of MiBench [7] on a RISC-V core.

The rest of the paper is structured as follows: Sect. 2 provides an overview of the background and related work on memory sharing and arbitration. Afterwards, in Sect. 3, we review AIP and introduce the system model and the notation used throughout this paper. Section 4 explains PPM and its software implementation as a suffix tree. In Sect. 5, the TAGE-BP is adapted for AIP. We analyze the main modifications and show the trade-offs involved and make a comparison to other predictors in Sect. 6. We finish with conclusions in Sect. 7.

Shared memory systems are designed such that memory can be used by each master as seamlessly as if it had its own. A master is any entity that uses the shared memory, e.g., a central processing unit (CPU) or a dedicated accelerator. The memory system has to grant memory access to masters upon their request. As multi-ported static random access memory (SRAM) is prohibitively area-expensive, it also has to resolve conflicts when multiple masters request access concurrently.

Architectures such as the T-CREST [8] or MERASA [9] are able to arbitrate between masters with different priorities and ensure an upper bound for the transaction latency. However, their additional delay of several cycles renders them unsuitable for TCMs.

On-chip memory arbiters for TCMs [10‐14] have instead been optimized for low latency omitting additional prioritization logic. They achieve a good average performance but the performance of high-priority tasks is degraded.

The number of conflicts may be reduced by banking SRAM cells [2, 13, 14, 15]: If simultaneous requests target different memory banks they are not conflicting. This approach works especially well with compile-time information available.

AIP on the other hand is able to consider priorities utilizing only run-time information. High-priority tasks are sped up without degrading the clock frequency. Simple prediction schemes, such as the ALB [2], predict up to 50 % of accesses on data memory. The work in [3] showed that more sophisticated statistical methods may be able to reach accuracies over 90 %. Previous work [16] presented TAGE with a prediction accuracy over 97 % at moderate hardware requirements. In this work, we extend the investigation with PPM and improve the access identification in PPM and TAGE.

3 Access Interval Prediction

3.1 System Overview

Figure 1 depicts the example model of a system employing AIP that is used throughout the paper. Two masters, a high-priority master (HPM) and a low-priority master (LPM), share a TCM. Here, the HPM is the data port of a RISC-V core, the LPM a data movement engine accessing the memory every cycle. When the HPM requests the memory, it should always be granted unhindered access by the arbiter.

To achieve this without elongating the path between TCM and masters, we add the AIP unit. It monitors the data memory interfaces of the HPM which is made up of access and adr. access is set whenever an access (read or write) on the address adr takes place. In addition, the AIP unit intercepts the program counter pc of the core.

The AIP unit predicts the next access and sets pred whenever it determines that the HPM will request the data memory in the next cycle. When this happens, the arbiter pre-allocates the memory to the HPM, and to the LPM otherwise.

This simple model allows performance comparison of different prediction schemes. AIP can also be scaled to more complex priority hierarchies, since it is possible to do predictions for multiple masters and the priority logic is placed offline.

3.2 Theory and Notation

We regard the program flow as a sequence of $n_\textrm{access}$ access blocks. Each access block $B_i = (A_i, I_i, P_i)$ contains the address $A_i$ accessed at cycle $K_i$, the length of the interval until the next access $I_i = K_{i+1} - K_i$ and the value of the program counter at access time $P_i = \texttt{pc}(K_i)$. The program stops with the last access at $K_\textrm{end}$.

An excerpt of an example program is shown in Fig. 2. In this case, $j+1$ access intervals were completed and can be summarized as history $H_j = {(B_i)}_{i=0}^{j}$. After each access on an address $A_i$, the predictor makes a prediction of the next interval $\hat{I}_i$ based on $H_{i-1}$, $A_i$ and $P_i$. Subsequently, the indicator signal pred is set at cycle $k = K_i + \hat{I}_i - 1$.

If the actual cycle equals the predicted, i.e., $\texttt{access}(k) = \texttt{pred}(k-1) = 1$, an access has been predicted correctly. We denote $n_\textrm{hit}$ the number of correctly predicted accesses, $P_\textrm{hit}= \frac{n_\textrm{hit}}{n_\textrm{access}}$ the hit rate, and $P_\textrm{miss}= 1 - P_\textrm{hit}$ the miss rate.

When $\texttt{pred}(k-1) = 1 \ne \texttt{access}(k)$, a false positive prediction occurs. The memory is mistakenly reserved for the HPM and memory utilization is impaired. The interval utilization, the percentage of idle cycles where the memory is not blocked because of false predictions, is determined from the number of false positives $n_\textrm{fp}$ by

$$\begin{aligned} U_\textrm{I}= 1 - \frac{n_\textrm{fp}}{K_\textrm{end}+ 1 - n_\textrm{access}}. \end{aligned}$$

(1)

4 Prediction by Partial Matching

Prediction by partial matching (PPM) originated in text compression [6]. The idea is to track for each subsequence with a length $n < N$ in the history $H_i$ the frequency of follow-up events. PPM then identifies the longest subsequence of $H_{i-1}$ which is a suffix of $H_i$. The prediction is given by the most frequent successor of the matching subsequence.

A suffix tree is a suitable data structure for implementation. An example for the history

$$\begin{aligned} H_i = (a,b,c,d,a,b,d,a,b,c,d,b,c,d,b) \end{aligned}$$

(2)

is depicted in Fig. 3. Every node in this tree with a value $\sigma $ and a prediction $\hat{\sigma }$ (c.f. Figure 3a) is the starting point of a subsequence of $H_{i-1}$. To find the node associated with a sequence, one has to traverse the tree following the history sequence elements backwards. The top left node in Fig. 3b belongs to the sequence (a, b, d), for example. If a sequence is not present in $H_{i-1}$ or longer than N, one will find its longest suffix present instead. The prediction attribute of the node yields the prediction for the found sequence.

For example, the current history may be as in (2) and the limit $N = 3$. Applying the suffix tree traversal to our example $H_i$ returns the node of the sequence $s = (c,d,b)$ with length $n = 3$ as the longest suffix of $H_i$. When it occurred in $H_{i-1}$ (once in this case), it was most often succeeded by c. Therefore, c is predicted.

The removal of all redundant nodes, i.e., those who have a parent node with the same $\hat{\sigma }$, renders a reduced tree as in Fig. 3c. Some nodes are only deactivated if they produce redundant predictions but have unredundant children, such as the node representing (a, b). We define the tree size by the number of active nodes, as these are the ones for which predictions have to be stored. The tree size shrinks after reduction and deactivation while the predictions stay the same.

Table 1

Access trace of a simple example program

i	0	1	2	3	4	5	6	7	8	9	10	11	12	13
$a_i$	$a^\textrm{a}$	$a^\textrm{a}$	$a^\textrm{b}$	$a^\textrm{a}$	$a^\textrm{b}$	$a^\textrm{a}$	$a^\textrm{b}$	$a^\textrm{a}$	$a^\textrm{b}$	$a^\textrm{a}$	$a^\textrm{b}$	$a^\textrm{a}$	$a^\textrm{b}$	$a^\textrm{a}$
$I_i$	1	1	2	1	2	3	4	1	1	1	2	3	4

We adopt PPM to fit AIP in a similiar fashion as prior work has adopted PPM for branch prediction [17, 18]. At each access, we construct a modified history dubbed the observation window

$$\begin{aligned} C_i(N) = (I_{i-N}, I_{i-N+1}, \ldots , I_{i-1}, a_i) \end{aligned}$$

(3)

out of the last N intervals and an identifier $a_{i}$ for the current access. Either the access address $A_{i}$ or the program counter $P_{i}$ or a combination of both $(A_{i}, P_{i})$ can serve as the access identifier. The observation window is used to traverse the suffix tree and to update it once $I_{i}$ is known. While other schemes utilizing more past access identifiers are conceivable, [17] demonstrated that usage of only one address does not impair the prediction accuracy considerably, at least for branch prediction.

Table 1 and Fig. 4 exemplify a program and the resulting full suffix tree for $N = 3$. The first layer contains all occurred accesses and the child nodes contain the preceeding intervals. When we want to predict $\hat{I}_{13}$ with suffix tree in Fig. 4, we land on the top left node in Fig. 4 belonging to the sequence $(3,4,a^{a})$ and obtain $\hat{I} = 1$.

5 TAGE Access Interval Predictor

A detailed description and analysis of the original TAGE-BP can be found in [5]. TAGE is a hardware approximation of PPM: Instead of matching sequences of any length, only a few lengths are selected after a geometric series and instead of iterating through the different history lengths, a hash map is used for each selected length.

5.1 Design

The TAGE-AIP’s basic structure, as shown in Fig. 5, is similar to the TAGE-BP. It is made out of $n_\textrm{T}$ components, of which every component t utilizes a different history length

$$\begin{aligned} L(t) = {\left\{ \begin{array}{ll} \lceil \alpha ^{t-1}L(1) - 0.5\rceil &{} \text {if } 1 \le t < n_\textrm{T}\\ 0 &{} \text {if } t = 0\\ \end{array}\right. }{.} \end{aligned}$$

(4)

The spacing $\alpha $ and the first length L(1) are chosen at design-time. For example, $\alpha = 2.88$ and $L(1) = 1$ lead to $L(t) = (0, 1, 3, 8, 24, \ldots )$.

The base table with $n_\textrm{I}$ entries provides the fallback prediction when no tag matches. It employs only the address and no past intervals. Each interval $I_i$ is stored at the address generated with a simple hash $h(a_i)$ (e.g., bit slicing) of the access identifier. When the access $a_j = a_i$ (or another for which $h(a_j) = h(a_i)$) occurs again, the base table emits $\hat{I}_j(0) = I_i$.

The other tables consist of a memory bank with $n_\textrm{I}$ entries from which the prediction information $\iota _i(t)$, a tag and the useful counter u of bit-widths $w_\mathrm {\iota }$, $w_\textrm{t}$ and $w_\textrm{u}$, respectively, are read. The subprediction $\hat{I}_i(t)$ of width $w_\textrm{I}$ is computed from $\iota _i(t)$ by means of the subpredictor $\Pi $. In contrast to the TAGE-BP, we consider a variety of subpredictors (c.f. section 5.3).

The table is addressed with a $\log _2 n_\textrm{I}$-bit wide hash $h_\textrm{a}$ of the observation window $C_i(L(t))$.

A different hash function $h_\textrm{t} \ne h_\textrm{a}$ produces a tag from the same input $C_i(L(t))$ to detect hash collisions of $h_\textrm{a}$. Both, $h_\textrm{a}$ and $h_\textrm{t}$, are based on circular shift registers similar to [17]. Only subpredictions for which the tag matches are considered. Of these, the component with the highest L(t) provides the final prediction $\hat{I}_i$. The useful counter u is used as in [5] to keep track of entries that can be overwritten because they provide the same prediction as another entry with a shorter context. This is akin to the removal or deactivation of redundant nodes in the PPM suffix tree detailed in Sect. 4.

$\hat{I}_i$ is finally compared with a cycle counter which is reset at every access. The predictor sets pred when the next cycle is $\hat{I}_i$ cycles after the last access.

5.2 Interval Representation

In the case of branch prediction, the event to be predicted, branch outcome, has only two possible values. A single bit suffices for representation. More bits need to be allocated for interval representation. In theory, the set of possible intervals $\mathbb {N}^+$ is infinite. However, large intervals occur rarely as can be seen in Fig. 6, which shows the probability P(I) of different intervals during the execution of selected MiBench programs. For example, only 6 % or 0.1 % of intervals exceed $I = 16$ or $I = 32$, respectively.

When $w_\textrm{I}$ bits are used, intervals up to length $I_\textrm{max}= 2^{w_\textrm{I}}$ are encoded correctly, whereas overflowing intervals $I > I_\textrm{max}$ are aliased to $I \bmod I_\textrm{max}$. On the other hand, larger $w_\textrm{I}$ generally lead to higher $w_\mathrm {\iota }$ and thus impose higher memory requirements. The storage requirements in bits of prediction information, tags and useful counters is

$$\begin{aligned} M = n_\textrm{T}n_\textrm{I}w_\mathrm {\iota }+ (n_\textrm{T}-1)n_\textrm{I}(w_\textrm{u}+ w_\textrm{t}). \end{aligned}$$

(5)

5.3 Subprediction and Update Method

Once an entry is allocated in a table at the address determined by $h_\textrm{a}(C_k(L(t)))$, it is possible to save arbitrary information $\iota (t)$ about the outcomes of the corresponding context. Based on this information, a subprediction $\hat{I}_i(t) = \Pi (\iota (t)$ is computed whenever the current context matches $C_i(L(t)) = C_k(L(t))$. When the subprediction was used as the final prediction, the entry will be updated.

The TAGE-BP stores a saturating counter that is incremented when the branch is taken and decremented otherwise. This method effectively tracks which of the two possible outcomes happened the most often and selects it.

An equivalent for AIP is the maximum-likelihood (ML) subprediction method which is similar to [3]. It can be realized by storing a counter for each possible interval and selecting the interval with the highest counter. The associated memory complexity is in the order of $\mathcal {O}(2^{w_\textrm{I}})$. With the help of hashes and truncation, the memory requirement might be reduced at the cost of degraded performance.

Instead, we evaluate three approaches with linear memory complexity:

AVG, where the prediction is the moving average of the last $n=2$ outcomes of $C_i^t$.

EXP1, where exponential smoothing is used with a smoothing factor $\beta = 0.5$. The new prediction is the average of the last prediction and the actual interval $\hat{I}_i(t):= \lfloor \beta I_i + (1-\beta )\hat{I}_i(t)\rfloor $.

EXP2, which is similar to EXP but with a smoothing factor of $\beta = 0.75$. The emphasis on newer intervals is therefore higher than in EXP1.

KEEP, where a prediction is kept for the entire existence of an entry until it is overwritten. If a new entry is allocated for $C_l(L(t))$ with the outcome $I_l$, then the table predicts $\hat{I}_i = I_l$ whenever $h_\textrm{a}(C_i(L(t))) = h_\textrm{a}(C_l(L(t)))$.

6 Performance Evaluation

Table 2

Overview of the access traces

Program	Variant	$n_\textrm{access}$	$K_\textrm{end}$
basicmath	“tiny”	9 565 757	120 790 227
bitcount	“small”	1 846 690	38 840 684
dijkstra	“tiny”	3 971 887	20 154 946
fft	“tiny”	4 997 606	29 598 468
gsm	“small”	5 750 522	28 265 373
jpeg	“small”	8 631 635	40 778 800
qsort	“small”	5 082 398	23 157 456
sha	“small”	2 109 141	17 489 771
stringsearch	“large”	1 654 440	7 433 104
susan	“small”	4 951 145	35 398 006

We executed selected programs of the MiBench [7] embedded benchmark suite on the simple RISC-V processor model from gem5 [19] (TimingSimpleCPU) and traced access, adr, and pc during execution. MiBench offers a “small” and “large” test case for each program. We usually opted for the “small” variant. But in some cases we chose the “large” or our own reduced “small” variant dubbed “tiny” instead to keep the number of accesses of each trace in roughly the same order of magnitude. Table 2 documents the selected programs, variants, access counts, and program lengths. Technical limitations prevented the use of the entire MiBench suite.

For performance evaluation, we set up a simulation framework as in Fig. 7. A master simulator sets access, adr, and pc according to the traces. These signals are fed into the predictor-under-evaluation. A tracker compares access and pred and records $n_\textrm{hit}$ and $n_\textrm{fp}$.

6.1 Prediction by Partial Matching

The performance of PPM with $N = 50$ for different programs and for different access identifiers (c.f. Section 4) is shown in Fig. 8. Identifying an access by its program counter rather than its address results in more accurate PPM for most programs. This is especially true for data-sensitive programs like the sorting algorithm “qsort” where the access address often depends on the data content. As a result, address-based prediction performs poorly with these kind of programs. On the other hand, it works slightly better for data-oblivious programs such as the hashing algorithm “sha”. The combination of both fares worse then just the program counter even though more information is employed. This suggests that the program counter generalizes well. On average, program-counter-based PPM achieves a hit ratio of $P_\textrm{hit}= 99 \%$.

While PPM is not feasible for a hardware implementation in typical embedded systems, the results may serve as a de facto limit of prediction accuracy. TAGE approximates PPM with practical adjustments and seems thus like a promising candidate for a highly-accurate access interval predictor. When designing the TAGE predictor, one needs to minimize the degradation caused by the adjustments.

6.2 Tagged Geometric History Length

It is observable in Fig. 9 that TAGE-AIP’s performance is slightly worse than PPM. Programs that PPM scores better for are also more accurately predicted by TAGE-AIP, with “susan” as a noticeable exception. Overall, $P_\textrm{hit}$ decreases by 2 % from PPM to TAGE-AIP.

Access Identification

Just like PPM, TAGE prediction benefits from program-counter-based access identification. The average hit rate is around 2 % higher when compared to $a_i = A_i$ and around 3 % higher when compared to $a_i = (A_i, P_i)$.

Interval Representation

Figure 10 illustrates the average miss rate and interval utilization for different $w_\textrm{I}$ values, as well as the associated memory requirements. If $w_\textrm{I}$ is too small, too many collisions occur and the predictor is unable to properly record and recall interval histories. Prediction performance, in terms of $P_\textrm{miss}$ and $U_\textrm{I}$, increases with $w_\textrm{I}$ until a saturation point is hit at $w_\textrm{I}= 6$ as almost none of the recorded traces exhibit intervals $I > 2^6 = 64$.

Subprediction and Update Method

There is a noticeable trade-off between prediction accuracy and memory requirements. The method with the lowest $P_\textrm{miss}$, ML, needs the most memory. AVG requires less memory but prediction accuracy is slightly impaired. EXP2 is slightly behind AVG but with a reduced memory footprint. Finally, EXP1 and KEEP are the most inaccurate. $U_\textrm{I}$ is nearly the same for all methods.

We choose $w_\textrm{I}= 6$ and EXP2 as this combination yields the lowest product $MP_\textrm{miss}$. Other parameters were determined as in Table 3 by random search within the specified ranges. More sophisticated design space exploration (DSE) for the other parameters is outside the scope of this paper. The selected configuration achieves $P_\textrm{hit}= 97.0\ \%$ and $U_\textrm{I}= 99.5\ \%$ with $M = 9.75$ kB of memory.

Table 3

Parameters of the TAGE access interval predictor

Parameter	Value	Range
$n_\textrm{T}$	5	1–15
L(1)	1	1–50
$\alpha $	2.88	0.5–50
$w_\textrm{t}$	9	4–20
$w_\textrm{u}$	2	1–5
$n_\textrm{I}$	1024	2–4096
$w_\textrm{I}$	6	2–8
$\Pi $	EXP2	ML,AVG,KEEP,EXP1,EXP2
$a_i$	$P_i$	$A_i,P_i,(A_i, P_i)$

Table 4

Comparison of different access interval predictors

Predictor	$P_\textrm{hit}$ [%]	$U_\textrm{I}$ [%]	M [kB]
CBE	34.7	91.3	0
ALB (1024 entries)	51.4	92.9	0.63
ALB (16384 entries)	56.4	93.6	10.00
PPM	99.0	99.9	Not applicable
TAGE-AIP	97.0	99.6	9.75

6.3 Performance Comparison

A comparison between different access interval predictors is drawn in Table 4. We re-implemented the ALB and the simple counter based estimator (CBE) from [2] and stimulated them with the same traces as PPM and TAGE-AIP. TAGE-AIP predicts more accurately than the ALB and the CBE which does not rely on SRAM blocks. The improved hit rate comes at a price of higher memory complexity. Therefore, TAGE may be used where a high prediction accuracy is more important than a low predictor overhead. PPM is the most accurate of all but not feasible for hardware implementation. Our software implementation written in Python takes up to 5 GB of random-access memory (RAM) for the suffix tree. For reference, the TAGE software implementation consumes around 17 MB RAM.

7 Conclusion and Outlook

In this paper, we derived access interval predictors from state-of-the-art branch predictors PPM and TAGE-BP. We explored the adjustments that are needed to predict intervals instead of branch outcomes. Our first subject of interest was PPM as a theoretical precursor of TAGE-BP. It was modified to work on sequences of access intervals and identifiers instead of branch outcomes and addresses. PPM was capable of predicting 99 % of access intervals. The results indicate that an access is, for the purpose of AIP, better identified by its program counter and not its address as in prior AIP research [16, 2, 3].

Starting from PPM, the TAGE-AIP makes several compromises to lower implementation complexity to a reasonable level at the cost of a slightly degraded hit rate. For example, intervals need to be represented with a finite amount of bits. In addition, the number of bits is higher than in branch prediction where one bit suffices to encode all events. This raises memory requirements but does not hinder accurate predictions. In our case, 5-6 bits are enough for interval representation as interval exceeding 32 or 64 cycles are rare.

Furthermore, different methods of storing interval information and computing a prediction from it were discussed. Keeping track of the frequency of each possible outcome, as is done in branch prediction with means of saturating up-down counters, is too memory-expensive in the case of AIP. Therefore we investigated alternatives of which a moving average of the last outcomes and exponential smoothing proved to be efficient.

The resulting TAGE-AIP reaches a hit rate of over 97 % while being implementable with moderate hardware requirements of 9.75 kB of SRAM. It abounds with parameters which may be tuned to achieve the optimal trade-off between accuracy and cost required by the AIP use case at hand. Prediction for computing platforms not covered in this paper, e.g., ones based on x86, ARM, or Xtensa, may be better served by different TAGE-AIP configurations.

Finally, we made a comparison to other access interval predictors found in literature. Our proposed predictor achieves higher prediction accuracy than any other implementable predictor at the price of higher memory complexity.

It should be noted that while TAGE-BP is known to be precise it also has a high update cost. Therefore, further research will address register transfer logic design of the proposed TAGE-AIP and investigate the timing overhead. Another open topic is the performance of adaptations of other popular branch predictors, such as gshare, GEHL, or Perceptron [4].

Acknowledgements

V.R. wrote the main manuscript. All authors reviewed the manuscript. The authors declare no conflict of interest. This work was funded in part by the German Federal Ministry of Education and Research (BMBF) in the project “E4C” (project number 16ME0426K). We thank the Center for Information Services and High Performance Computing (ZIH) at TU Dresden for generous allocation of compute resources.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Vorheriger Artikel Special Issue on SAMOS 2022

Nächster Artikel A Practical Approach for Employing Tensor Train Decomposition in Edge Devices

Haas, S., Seifert, T., Nöthen, B., Scholze, S., Höppner, S., Dixius, A., Pérez Adeva, E., Augustin, T., Pauls, F., Moriam, S., Hasler, M., Fischer, E., Chen, Y., Matúš, E., Ellguth, G., Hartmann, S., Schiefer, S., Cederström, L., Walter, D., Henker, S., Hänzsche, S., Uhlig, J., Eisenreich, H., Weithoffer, S., Wehn, N., Schüffny, R., Mayr, C., Fettweis, G.: A Heterogeneous SDR MPSoC in 28 nm CMOS for Low-Latency Wireless Applications. In: DAC. ACM Press, New York, New York, USA (2017). https://doi.org/10.1145/3061639.3062188

Wittig, R., Pauls, F., Matus, E., Fettweis, G.: Access Interval Prediction for Tightly Coupled Memory Systems. In SAMOS, Springer, Cham (2019)

R. Wittig, M. Hasler, E. Matus, G. Fettweis: Statistical Access Interval Prediction for Tightly Coupled Memory Systems. In: COOL CHIPS. IEEE (2019). https://doi.org/10.1109/CoolChips.2019.8721304

Mittal, S.: A survey of techniques for dynamic branch prediction. Concurreny Comput. : Pract. Exp. 31(1), 1–36 (2018)

Seznec, A., Michaud, P.: A case for (partially) TAgged GEometric history length branch prediction. J. Inst.-Level Parallelism 8, 23 (2006)

Cleary, J., Witten, I.: Data Compression Using Adaptive Coding and Partial String Matching. IEEE Trans. Commun. 32(4), 396–402 (1984)CrossRef

Guthaus, M.R., Ringenberg, J.S., Ernst, D., Austin, T.M., Mudge, T., Brown, R.B.: MiBench: A free, commercially representative embedded benchmark suite. In: 4th Annu. Workshop on Workload Characterization. IEEE (2001)

Schoeberl, M., Abbaspour, S., Akesson, B., Audsley, N., Capasso, R., Garside, J., Goossens, K., Goossens, S., Hansen, S., Heckmann, R., Hepp, S., Huber, B., Jordan, A., Kasapki, E., Knoop, J., Li, Y., Prokesch, D., Puffitsch, W., Puschner, P., Rocha, A., Silva, C., Sparsø, J., Tocchi, A.: T-CREST: Time-predictable multi-core architecture for embedded systems. J. Syst. Architect. 61(9), 449–471 (2015). https://doi.org/10.1016/j.sysarc.2015.04.002CrossRef

Paolieri, M., Quiñones, E., Carzola, F. J., Bernat, G., Valero, M.: Hardware Support for WCET Analysis of Hard Real-time Multicore Systems. In: ISCA. ACM Press, New York, New York, USA (2009). https://doi.org/10.1145/1555754.1555764

10.

Ax, J., Sievers, G., Daberkow, J., Flasskamp, M., Vohrmann, M., Jungeblut, T., Kelly, W., Porrmann, M., Rückert, U.: CoreVA-MPSoC: A Many-Core Architecture with Tightly Coupled Shared and Local Data Memories. IEEE Trans. Parallel Distrib. Syst. 29(5), 1030–1043 (2018). https://doi.org/10.1109/TPDS.2017.2785799CrossRef

11.

Gelado, I., Cabezas, J., Stone, J. E., Patel, S., Navarro, N., Hwu, W.-m. W.: An Asyemmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems. In: ASPLOS. ACM Press, New York, New York, USA (2010). https://doi.org/10.1145/1736020.1736059

12.

Wittig, R., Hasler, M., Matus, E., Fettweis, G.: Queue Based Memory Management Unit for Heterogeneous MPSoCs. In: DATE. IEEE (2019). https://doi.org/10.23919/DATE.2019.8715129

13.

Bates, D., Bradbury, A., Koltes, A., Mullins, R.: Exploiting tightly-coupled cores In: SAMOS. IEEE (2013). https://doi.org/10.1109/SAMOS.2013.6621138CrossRef

14.

Rahimi, A., Loi, I., Kakoee, M.R., Benini, L.: A fully-synthesizable single-cycle interconnection network for shared-L1 processor clusters. In DATE. IEEE (2011). https://doi.org/10.1109/DATE.2011.5763085CrossRef

15.

Gautschi, M., Rossi, D., Benini, L.: Customizing an Open Source Processor to Fit in an Ultra-Low Power Cluster with a Shared L1 Memory. In: GLSVLSI. IEEE (2014). https://doi.org/10.1145/2591513.2591569

16.

Razilov, V., Wittig, R., Matúš, E., Fettweis, G.: Tagged Geometric History Length Access Interval Prediction for Tightly Coupled Memory Systems. In SAMOS, Springer, Cham (2022)CrossRef

17.

Michaud, P.: A PPM-like, tag-based branch predictor. J. Inst.-Level Parallelism 7, 10 (2005)

18.

Michaud, P.: Analysis of a tag-based branch predictor. Research Report RR-5366, INRIA (2004)

19.

Lowe-Power, J., et al.: The gem5 Simulator: Version 20.0+. (2020). https://arxiv.org/abs/2007.03152

Titel: Access Interval Prediction by Partial Matching for Tightly Coupled Memory Systems
verfasst von: Viktor Razilov
Robert Wittig
Emil Matúš
Gerhard Fettweis
Publikationsdatum: 13.02.2024
Verlag: Springer US
Erschienen in: International Journal of Parallel Programming / Ausgabe 1-2/2024
Print ISSN: 0885-7458
Elektronische ISSN: 1573-7640
DOI: https://doi.org/10.1007/s10766-024-00764-1

i	0	1	2	3	4	5	6	7	8	9	10	11	12	13
\(a_i\)	\(a^\textrm{a}\)	\(a^\textrm{a}\)	\(a^\textrm{b}\)	\(a^\textrm{a}\)	\(a^\textrm{b}\)	\(a^\textrm{a}\)	\(a^\textrm{b}\)	\(a^\textrm{a}\)	\(a^\textrm{b}\)	\(a^\textrm{a}\)	\(a^\textrm{b}\)	\(a^\textrm{a}\)	\(a^\textrm{b}\)	\(a^\textrm{a}\)
\(I_i\)	1	1	2	1	2	3	4	1	1	1	2	3	4

Springer Professional

Access Interval Prediction by Partial Matching for Tightly Coupled Memory Systems

Abstract

Publisher's Note

1 Introduction

3 Access Interval Prediction

3.1 System Overview

3.2 Theory and Notation

4 Prediction by Partial Matching

5 TAGE Access Interval Predictor

5.1 Design

5.2 Interval Representation

5.3 Subprediction and Update Method

6 Performance Evaluation

6.1 Prediction by Partial Matching

6.2 Tagged Geometric History Length

6.3 Performance Comparison

7 Conclusion and Outlook

Acknowledgements

Publisher's Note

Premium Partner

Parameter	Value	Range
\(n_\textrm{T}\)	5	1–15
L(1)	1	1–50
\(\alpha \)	2.88	0.5–50
\(w_\textrm{t}\)	9	4–20
\(w_\textrm{u}\)	2	1–5
\(n_\textrm{I}\)	1024	2–4096
\(w_\textrm{I}\)	6	2–8
\(\Pi \)	EXP2	ML,AVG,KEEP,EXP1,EXP2
\(a_i\)	\(P_i\)	\(A_i,P_i,(A_i, P_i)\)

Springer Professional

Abstract

Publisher's Note

1 Introduction

2 Background and Related Work

3 Access Interval Prediction

3.1 System Overview

3.2 Theory and Notation

4 Prediction by Partial Matching

5 TAGE Access Interval Predictor

5.1 Design

5.2 Interval Representation

5.3 Subprediction and Update Method

6 Performance Evaluation

6.1 Prediction by Partial Matching

6.2 Tagged Geometric History Length

6.3 Performance Comparison

7 Conclusion and Outlook

Acknowledgements

Publisher's Note

Weitere Artikel der Ausgabe 1-2/2024

ControlPULP: A RISC-V On-Chip Parallel Power Controller for Many-Core HPC Processors with FPGA-Based Hardware-In-The-Loop Power and Thermal Emulation

Hardware-Aware Evolutionary Explainable Filter Pruning for Convolutional Neural Networks

A Practical Approach for Employing Tensor Train Decomposition in Edge Devices

Special Issue on SAMOS 2022

Investigating Methods for ASPmT-Based Design Space Exploration in Evolutionary Product Design

Premium Partner