research-article

Open Access

Register-Pressure-Aware Instruction Scheduling Using Ant Colony Optimization

Authors:
Ghassan Shobaki

California State University, Sacramento, CA

California State University, Sacramento, CA

0000-0001-8727-671X
View Profile

,
Vahl Scott Gordon

California State University, Sacramento, CA

California State University, Sacramento, CA
View Profile

,
Paul McHugh

California State University, Sacramento, CA

California State University, Sacramento, CA
View Profile

,
Theodore Dubois

California State University, Sacramento, CA

California State University, Sacramento, CA
View Profile

,
Austin Kerbow

California State University, Sacramento, CA

California State University, Sacramento, CA
View Profile

ACM Transactions on Architecture and Code Optimization Volume 19 Issue 2Article No.: 23pp 1–23https://doi.org/10.1145/3505558

Published:31 January 2022Publication History

ACM Transactions on Architecture and Code Optimization

Abstract

This paper describes a new approach to register-pressure-aware instruction scheduling, using Ant Colony Optimization (ACO). ACO is a nature-inspired optimization technique that researchers have successfully applied to NP-hard sequencing problems like the Traveling Salesman Problem (TSP) and its derivatives. In this work, we describe an ACO algorithm for solving the long-standing compiler optimization problem of balancing Instruction-Level Parallelism (ILP) and Register Pressure (RP) in pre-allocation instruction scheduling. Three different cost functions are studied for estimating RP during instruction scheduling. The proposed ACO algorithm is implemented in the LLVM open-source compiler, and its performance is evaluated experimentally on three different machines with three different instruction-set architectures: Intel x86, ARM, and AMD GPU. The proposed ACO algorithm is compared to an exact Branch-and-Bound (B&B) algorithm proposed in previous work. On x86 and ARM, both algorithms are evaluated relative to LLVM's generic scheduler, while on the AMD GPU, the algorithms are evaluated relative to AMD's production scheduler. The experimental results show that using SPECrate 2017 Floating Point, the proposed algorithm gives geometric-mean improvements of 1.13% and 1.25% in execution speed on x86 and ARM, respectively, relative to the LLVM scheduler. Using PlaidML on an AMD GPU, it gives a geometric-mean improvement of 7.14% in execution speed relative to the AMD scheduler. The proposed ACO algorithm gives approximately the same execution-time results as the B&B algorithm, with each algorithm outperforming the other on a substantial number of hard scheduling regions. ACO gives better results than B&B on many large instances that B&B times out on. Both ACO and B&B outperform the LLVM algorithm on the CPU and the AMD algorithm on the GPU.

1 INTRODUCTION

Register allocation and instruction scheduling are two fundamentally important compiler optimizations. In most production compilers, register allocation and instruction scheduling are done in two different passes, because doing them simultaneously in one pass would be too complex.

Instruction scheduling and register allocation are closely related, because the instruction order computed in the pre-allocation instruction scheduling pass determines the register pressure (RP), which is the number of virtual registers that have overlapping live ranges and must be assigned to different physical registers. RP reflects the demand for physical registers. If the demand for registers exceeds the number of physical registers on the target machine, the register allocator must spill some virtual registers to main memory by adding load and store instructions (spill code) that may slow the program. On a Graphics Processing Unit (GPU), spilling is rare and extremely expensive. However, RP determines the GPU occupancy, which is the number of thread groups that can be executed in parallel. When each thread uses fewer registers, the GPU can run more threads in parallel. Occupancy usually has a high impact on the execution time of a GPU program. Therefore, the impact of RP reduction on the performance of a GPU program is generally greater than its impact on the performance of a CPU program.

Minimizing RP is not the only objective of pre-allocation instruction scheduling. In fact, the original objective of instruction scheduling is exploiting Instruction-Level Parallelism (ILP). ILP is exploited by executing independent instructions in parallel to minimize the schedule length, but this tends to increase RP, as more registers are needed to hold the results of the instructions that are executed in parallel. Thus, maximizing ILP and minimizing RP are two conflicting objectives that must be balanced in pre-allocation scheduling.

Compiler scheduling for ILP has a particularly high impact on the performance of in-order processors. GPUs do not reorder instructions within a single thread at run time. Therefore, the impact of compiler scheduling for ILP on GPU performance is generally higher than its impact on the performance of a modern out-of-order CPU.

The problem of balancing ILP and RP in pre-allocation instruction scheduling is a fundamental open problem in code generation and optimization. Even optimizing one of these two conflicting objectives (ILP or RP) is NP-hard [Cooper and Torczon 2011]. Current production compilers solve this problem using heuristics (usually greedy heuristics). However, recent research on both CPUs [Lozano et al. 2019, Shobaki et al. 2019] and GPUs [Rawat et al. 2018, Shobaki et al. 2020] has shown that these heuristics may produce sub-optimal results that significantly degrade performance. On GPUs, pre-allocation instruction scheduling is particularly important, because both RP and ILP significantly impact the execution time.

In the operations research (OR) field, researchers have successfully computed precise, and often exact, solutions to NP-hard problems using intelligent search techniques, including Branch-and-Bound (B&B), Constraint Programming (CPR), and Ant Colony Optimization (ACO). Despite the success of these techniques in OR, applying such computationally expensive techniques to NP-hard compiler optimization problems was impractical in the past. However, today's powerful computing has motivated some researchers to explore applying some of these techniques to NP-hard problems in code optimization [Domagala et al. 2016, Lozano et al. 2018 and 2019, Shobaki et al. 2019 and 2020]. The results of this recent research show that applying such techniques can significantly improve performance in some cases and that the increase in compile time can be controlled by applying them selectively to the hot code and setting reasonable time limits [Shobaki et al. 2013 and 2019].

In this paper, we explore applying ACO to the register-pressure-aware instruction scheduling problem in compilers, and we apply it to both CPU and GPU targets. ACO is a population-based optimization technique inspired from nature. Other population-based techniques include genetic algorithms, genetic programming, and particle swarm optimization. Ants in nature find short paths between a food source and their nest by depositing pheromones as they carry food. The act of ants following the pheromones while occasionally straying from the pheromone trail and the natural dissipation of pheromones cause the trail over time to shorten and approach optimality. This phenomenon inspired a class of ant-based algorithms for finding optimal solutions to NP-hard optimization problems. As detailed in Section 4, a pheromone table is used in an ACO algorithm to simulate the deposition and dissipation of pheromones.

ACO was introduced by Dorigo and Gambardella [1997] to compute precise solutions to large instances of the Traveling Salesman Problem (TSP), which is a well-known NP-hard problem. In later research, the technique was applied to a number of related problems such as job-shop scheduling [Martens et al. 2007], protein folding [Hu et al. 2008], image processing [Jevtić 2009], and many others.

The ACO algorithm proposed in the current paper is based on the Ant Colony System (ACS) described by Gambardella and Dorigo [2000] for solving the Sequential Ordering Problem (SOP), which is a generalization of the TSP. In the SOP, the objective is finding a node permutation that minimizes a path length without violating a given set of precedence constraints [Escudero 1988]. Our proposed ACO algorithm capitalizes on the similarity between the instruction scheduling problem and the SOP. The objective in both problems is finding a minimum-cost sequence that satisfies certain precedence constraints. To the best of our knowledge, our work is the first attempt to apply ACO to the register-pressure-aware instruction scheduling problem in compilers.

To apply ACO to the compiler instruction scheduling problem, the problem must be formulated as a combinatorial optimization problem with an explicit cost function. In previous work, we explored two different approaches to this two-objective optimization problem. The first approach is a single-pass approach in which the objective is minimizing a weighted sum of the schedule length and the RP cost. The second approach is a two-pass approach in which RP is treated as a primary objective that is minimized in the first pass, while schedule length is treated as a secondary objective that is minimized in the second pass. As explained in previous work [Shobaki et al. 2020], a two-pass approach is more effective on a GPU target, because optimizing occupancy is critically important and the two-pass approach ensures that enough time is spent searching for the best occupancy. In the current paper, we use the single-pass approach for CPU targets and the two-pass approach for the GPU target.

In previous work, we introduced multiple cost functions for estimating RP during instruction scheduling, including the Peak Excess Register Pressure (PERP) [Shobaki et al. 2013] and the Sum of Live Interval lengths (SLIL) [Shobaki et al. 2019] for CPU targets, and the Adjusted Peak Register Pressure (APRP) [Shobaki et al. 2020] for GPU targets. In this paper, we use all three cost functions. RP cost functions are summarized in Section 2.

The proposed ACO algorithm is implemented in the LLVM compiler [Lattner 2004] and its performance is evaluated on three different machines with three different instruction-set architectures: Intel x86, ARM, and AMD GPU. The proposed ACO algorithm is compared to our exact branch-and-bound algorithm [Shobaki et al. 2019, 2020]. On x86 and ARM, both algorithms are evaluated relative to LLVM's generic scheduler using the SPEC CPU 2017 benchmarks [SPEC 2017], while on the AMD GPU, the algorithms are evaluated relative to AMD's production scheduler using the PlaidML benchmarks [PlaidML]. AMD's algorithm is well-tuned for the AMD GPU.

The experimental results show that using SPECrate 2017 Floating Point (FP2017 for short), the proposed algorithm gives geometric-mean improvements of 1.13% and 1.25% in execution speed on x86 and ARM, respectively, relative to the LLVM scheduler. Using PlaidML on an AMD GPU, it gives a geometric-mean improvement of 7.14% in execution speed relative to the AMD scheduler. The ACO algorithm gives approximately the same execution-time results as the B&B algorithm, with each algorithm outperforming the other on a substantial number of hard scheduling regions. ACO gives better results than B&B on many large instances that B&B times out on. Both the ACO algorithm and the B&B algorithm outperform the LLVM algorithm on the CPU targets and the AMD algorithm on the GPU target. An important advantage of the ACO algorithm is that it has a higher degree of parallelism and is thus more likely to benefit from parallelization on a massively parallel processor.

The rest of this paper is organized as follows. Section 2 defines the terms used in the paper and explains the preliminary concepts. Section 3 summarizes previous work. Section 4 describes the proposed algorithm. Section 5 presents the experimental results, and Section 6 summarizes the conclusions and outlines future work.

2 BACKGROUND

The problem addressed in this paper is pre-allocation instruction scheduling with RP taken into account. Scheduling is done within a basic block. A basic block is a straight-line piece of code with no branches out of it except at the end of the block and no branches into it except at the beginning of the block [Cooper and Torczon 2011]. The input to the instruction scheduler is a sequence of instructions with their dependencies represented by a data dependence graph (DDG). The output is a schedule, which is an assignment of instructions to machine cycles. The objective is finding a schedule that achieves the best possible balance between schedule length and RP. The schedule length is the number of cycles used in the schedule, and RP is modeled using one of the cost functions described below.

The number of cycles in the schedule depends on the machine model. Our implementation of the proposed algorithm supports a general machine model with an arbitrary number of functional units and issue slots per cycle and arbitrary latencies. It also supports both pipelined and un-pipelined instructions. The experimental results, however, were produced using a simple machine model. In this simple model, the processor can issue one instruction of any type in each cycle, but the model still captures instruction latencies, and this appears to be the most important factor that affects performance. In previous work, we experimented with more accurate machine models, and they did not seem to make a significant difference in terms of execution-time performance on the target processors that we experimented with.

In the pre-allocation scheduling phase, registers in the code are virtual registers. In certain special cases, the code may contain physical registers. Each register has a specific data type. Register pressure computation is based on the Def and Use sets of the scheduled instructions. The Def set of an instruction is the set of registers that are defined by that instruction, and the Use set is the set of registers that the instruction uses. Our algorithm and our implementation allow an instruction to have an arbitrary number of Defs and Uses. Given an instruction schedule, the register pressure for a given data type at a given point in the schedule is the number of registers of that type that are live at that point. A register is live at a given point in a schedule if it has been defined but at least one instruction that uses it has not been scheduled yet at that point.

In previous work, we used two different approaches to the pre-allocation scheduling problem: a single-pass approach [Shobaki et al. 2013, 2019] and a two-pass approach [Shobaki et al. 2020]. These approaches are described next.

2.1 Single-Pass Approach

In this approach, a weighted sum of schedule length and RP is optimized in a single pass [Shobaki et al. 2013]. Given a sequence of instructions, the objective is to find a schedule S that minimizes the following cost function: (1) \[\begin{equation} Cost\left( S \right) = \left| S \right| - {\rm{\ }}{L_s} + w( {P - {L_p}} ) \end{equation}\]where |S| is the schedule length, L_s is a lower bound on the schedule length, P is the RP cost, L_p is a lower bound on the RP cost, and w is the register pressure weight (RPW). The RPW parameter expresses the weight of RP relative to the schedule length. A tight lower bound on the schedule length may be computed using the algorithm of Langevin and Cerny [1996]. In this work, the single-pass approach is used in scheduling for CPU targets.

2.2 Two-Pass Approach

On a GPU target, minimizing RP maximizes occupancy, and maximizing occupancy generally has a higher impact on GPU performance than exploiting ILP (minimizing the schedule length). In theory, this can be captured in the single-pass approach by setting the RPW to a sufficiently large value. Experimentally, however, we found that an extremely high RPW results in a very slow algorithm that may not spend enough time minimizing RP [Shobaki et al. 2020]. Therefore, we introduced the two-pass approach in which occupancy is maximized (RP is minimized) in the first pass as a primary objective and ILP is maximized in the second pass as a secondary objective. In the second pass, the algorithm searches for a minimum-length schedule among all the schedules that maintain the best occupancy found in the first pass. The first pass is called the occupancy pass, and the second pass is called the ILP pass. As explained in detail in the original paper, the two-pass approach is more effective for a GPU target, because it ensures that adequate time is spent optimizing occupancy. In this work, we use the two-pass approach for the GPU target.

In the two-pass approach to scheduling for the GPU, the best occupancy found in the first pass is treated as a constraint in the second pass. So, in the second pass, the algorithm searches for the shortest possible schedule among all the schedules that satisfy that occupancy constraint, and any schedule that does not satisfy the occupancy constraint is treated as an invalid schedule.

2.3 Register-Pressure Cost Functions

In previous work, we explored multiple cost functions for representing RP during scheduling, including the Peak Excess Register Pressure (PERP) [Shobaki et al. 2013] and the Sum of Live Interval Lengths (SLIL) [Shobaki et al. 2019] for CPU targets and the Adjusted Peak Register Pressure (APRP) [Shobaki et al. 2020] for GPU targets. In this subsection, we briefly describe these cost functions. The details can be found in the original papers.

The Peak Register Pressure (PRP) of a given data type in a given schedule is the maximum value of that type's RP at any point in the schedule. The PERP of a given data type is the difference between that type's PRP and the number of available physical registers of that data type on the target machine.

Assuming that the code is in Static Single Assignment (SSA) form [Cooper and Torczon 2011], each virtual register in a given basic block has a live interval that consists of one definition and one or more uses. Therefore, each live interval has one defining instruction and one or more using instructions. The Live Interval Length (LIL) is the number of instructions in the instruction sequence that starts with the definition and ends with the last use. The SLIL is the sum of live interval lengths for all virtual registers in a given schedule. Since live interval overlapping makes live intervals longer, a larger SLIL indicates more overlapping among live intervals, and thus higher RP.

As explained in previous work [Shobaki et al. 2019], the SLIL cost function may capture live interval overlaps that are not captured by the PERP cost function. SLIL captures the overlaps among all intervals, while PERP captures only the overlaps that contribute to the peak pressure. In a high-pressure scheduling region, the peak-pressure point in the schedule is not the only point that will cause the register allocator to insert spill code. Therefore, minimizing SLIL is more likely to minimize spill code than minimizing PERP, especially in larger scheduling regions with multiple high-pressure segments.

On a GPU, multiple PRP values may give the same occupancy value. To account for this, we introduced the adjusted peak register pressure (APRP) step function for modeling occupancy during instruction scheduling. The APRP of a given PRP value x is the maximum PRP value that gives the same occupancy as x. For example, on the AMD GPU used in this work, a PRP of 24 vector general-purpose registers (VGPRs) or less gives the maximum occupancy of 10 wavefronts, while PRP values in the range [25–28] give an occupancy of 9 wavefronts (a wavefront is a group of GPU threads that must be executed in lockstep). Therefore, PRP values in the range [1–24] are mapped to an APRP of 24 and PRP values in the range [25–28] are mapped to an APRP of 28.

3 RELATED WORK

Compiler researchers have been studying instruction scheduling for many decades. Instruction scheduling for minimum register pressure, or the Minimum-Register Instruction Sequence (MRIS) problem, has been studied since 1970 by Sethi and Ullman [1970] who proposed an algorithm for finding an instruction order for computing an expression using the minimum number of registers when the DDG is a tree. However, a tree DDG is a special case with limited practical value, as the DDGs constructed in an optimizing compiler for real code are often not trees.

A more practical algorithm for balancing ILP and RP was proposed by Goodman and Hsu [1988]. That algorithm was heuristic-based with no guarantee of optimality. Other heuristic approaches were then proposed by Govindarajan et al. [2003], Touati [2005] and Barany and Krall [2013].

Over the past two decades, some researchers proposed combinatorial approaches (intelligent search techniques) that are guaranteed to give optimal solutions if they terminate within a given time limit. Kessler [1998] proposed a dynamic programming approach to both the MRIS problem and the general problem of balancing RP and ILP. Barany and Krall [2013] proposed an Integer Linear Programming solution to the MRIS problem. Malik [2008] used Constraint Programming (CPR) to solve the MRIS problem, and Domagala et al. [2016] used CPR to integrate RP-aware instruction scheduling and loop unrolling. Lozano et al. [2018] used CPR to solve the integrated instruction scheduling and register allocation problem. Lozano provides an excellent survey of combinatorial approaches to instruction scheduling and register allocation [Lozano 2019].

Shobaki et al. [2013, 2019, 2020] presented a B&B algorithm for solving the RP-aware instruction scheduling problem. In the experimental evaluation, we compare the proposed ACO algorithm with that B&B algorithm. B&B is an exact technique for solving combinatorial optimization problems by conducting an exhaustive search of the solution space. The exhaustive search may complete in reasonable time on many instances if powerful pruning techniques are used to prune non-promising sub-spaces as early as possible. Shobaki et al. successfully applied this technique to RP-aware instruction scheduling and showed that it can in some cases produce a significant performance improvement relative to the greedy heuristics used in production compilers. However, that B&B algorithm times out on thousands of instances in SEPC CPU2006.

Rawat et al. [2018] describe a source-level re-ordering algorithm to minimize RP for stencil computation on the GPU. Using a pattern-specific approach, they report speedups in the range of 1.22x to 2.43x for NVCC and 1.15x to 2.08x for LLVM. These results show the significant impact of RP-aware scheduling on GPU performance.

ACO was introduced in 1992 by Marco Dorigo in his doctoral thesis [Dorigo 1992] and later published (as Ant Colony System) in the first issue of IEEE Transactions on Evolutionary Computation [Dorigo and Gambardella 1997]. The technique was originally applied to the TSP. Our proposed algorithm is based on the ACO proposed by Gambardella and Dorigo for the SOP [2000]. Using an ACO approach to solve other scheduling problems was studied by Ferrandi et al. [2010] and Wang et al. [2007]. Both explored the use of ACO for operation and resource scheduling, a somewhat more general problem of scheduling resources in software systems. They both noted the applicability to compilers but did not actually implement their algorithms in compilers. Those works, however, focused on minimizing the schedule length (exploiting ILP) without considering RP. To our knowledge, our proposed algorithm is the first ACO algorithm for RP-aware instruction scheduling.

More direct applications of ACO to compiler optimization have been proposed by Lintzmayer et al. [2012] and de Souza Xavier et al. [2018]. Lintzmayer et al. used an ACO approach to do graph coloring register allocation, and de Souza Xavier et al. used ACO to do design space exploration. Both reported good results.

The pre-allocation scheduling problem addressed in this paper is a multi-objective optimization problem (MOOP). Researchers have used various approaches to MOOPs. A common approach is to compute a weighted sum of the different objectives, thus reducing the problem from a multi-objective problem to a single-objective problem. This approach has been used in a number of ACO applications [Alaya 2007, Infante 2010]. Another approach that has been used in ACO applications is the Tchebycheff weighted metric approach that is based on identifying an optimum for each objective, forming an idealized (utopian) solution and then finding the nearest feasible solution to the idealized solution [Mu 2019]. A third approach to MOOPs is searching for Pareto optimal solutions. A solution to a MOOP is pareto optimal if no objective can be improved without degrading at least one other objective.

Researchers have used multiple techniques to adapt ACO to solving MOOPs. Whereas some researchers used a single pheromone table as in a standard ACO algorithm, other researchers used multiple pheromone tables, usually one per objective [Gambardella 1999, Doerner 2004, Infante 2010]. Some algorithms keep track of the Pareto optimal solutions found [Liu 2019, Mu 2019], and others limit pheromone table updates to favor those solutions that are Pareto optimal [Alaya 2007]. Reviews of various ACO approaches to solving MOOPs can be found in the papers of Leguizamón [2010] and Ning [2019].

4 ALGORITHM DESCRIPTION

ACO is inspired by the way ants utilize pheromones to construct trails. In ACO, artificial ants traverse paths in a graph in a probabilistic manner. The quality of a generated path determines the amount of pheromone deposited on each of the arcs that constitute that path. Subsequent ants then select the sequence of nodes they visit based on the amount of pheromone on each possible arc as well as a problem-specific heuristic, which we call the guiding heuristic. It is also customary to simulate the natural gradual dissipation of pheromone.

4.1 ACO Applied to Instruction Scheduling

Our proposed algorithm is based on the ACO algorithm proposed by Gambardella and Dorigo [2000] for the SOP. This particular version of ACO was chosen because of the similarities between instruction scheduling and the SOP. In both problems, the input is a DDG representing precedence constraints, and the objective is finding an order that minimizes a certain cost function. However, the RP-aware instruction scheduling problem is more complex than the SOP, because it is a MOOP. In our single-pass approach, using a weighted sum of two objectives (the RP cost and the schedule length) adds more complexity to the problem, while in our two-pass approach, the additional complexity arises from treating RP as a constraint in the second pass. Another complexity in both approaches is the presence of latency constraints.

The algorithm generates a large number of candidate schedules. Each candidate schedule is a complete sequence of instructions for the given scheduling region. Whenever a better candidate schedule is found, it is saved as the best schedule found so far. The algorithm terminates when a certain number of iterations N have been performed without finding an improvement of the best schedule. This number of iterations is called the termination condition. The termination condition is used to control the amount of time given to the ACO algorithm to find a good solution. In the context of compiler instruction scheduling, the termination condition is used to control compile time. In the experimental evaluation, we show how the termination condition affects performance.

ACO uses a pheromone table to guide the construction of candidate schedules. The pheromone table is a two-dimensional table with n rows and n columns, where n is the number of instructions in the current scheduling region. For any two instructions i and j, the value τ_ij in the pheromone table is the amount of pheromone placed on the arc between instructions i and j. If j is a root node in the DDG, a pseudo-instruction i₀ is used in the pheromone table instead of i.

A candidate schedule is initially empty. It is built by selecting one instruction at a time until the schedule is complete. At each step during schedule construction, the next instruction is selected from the ready list, which is a list of the unscheduled instructions that have had all of their dependencies satisfied. The ready list is updated each time an instruction is added to the schedule, as scheduling an instruction may make some of its dependents ready.

The choice of the next instruction to select (i.e., the next link that an ant traverses) is done randomly, but with a bias that takes into account both the values in the pheromone table and the guiding heuristic. For each instruction i, η_i is the value of the guiding heuristic for i. η_i is encoded as a number in the range [1, 2] with larger values representing higher priorities. Our implementation supports multiple guiding heuristics as detailed in Section 4.2. Each instruction i in the ready list is assigned a score τ_{i =} τ_liη_i, where l is the previous instruction selected. Then the next instruction is selected using one of two methods:

First method with probability s/n: The next instruction is selected through biased random selection, where the probability of selecting an instruction from the ready list is proportional to its τ_i. In population-based optimization algorithms, this approach is commonly referred to as fitness-proportional selection.
Second method with probability 1-s/n: The next instruction is the instruction with the maximum τ_i.

In the above probabilities, s is an adjustable parameter used to control the balance between exploitation and exploration. The value s represents the average number of instructions selected through biased random selection (exploration) as opposed to strict pheromone-based selection (exploitation). For example, if n is 100 and s is 20, 20% of the selections will be made based on biased randomness (exploration) and 80% of the selections will be made based on the highest pheromone value (exploitation). In population-based optimization systems, exploration refers to using individuals in the population (in this case artificial ants) to examine previously untested possibilities, while exploitation refers to using individuals in the population to continue examining possible solutions that are similar to the better solutions that have been discovered so far. Experimentally, s = 10 for any n gave better results than any other setting that we tried.

Each iteration simulates a certain number of ants, and each ant generates a candidate schedule. As shown in the experimental evaluation, we experimented with multiple settings of the number of ants per iteration. At the end of each iteration, the pheromone table is updated based on the best schedule in that iteration (the iteration winner). In this update, the pheromone on each link (i, j) in the iteration's best schedule is incremented according to the following formula: (2) \[\begin{equation} {{\rm{\tau }}_{{\rm{ij}}}}{\rm{\ }} + = {\rm{\ max}}\left( {\left( {1 - \frac{{{C_{best}}}}{{k*{C_{heur}}}}} \right){\rm{\ }}\left( {{d_{max}}\ - {\rm{\ }}{d_{min}}} \right),{\rm{\ }}0} \right){\rm{\ }} + {\rm{\ }}{d_{min}} \end{equation}\]where C_best is the cost of the iteration's best schedule, C_heur is the cost of the initial heuristic schedule, d_min and d_max are the minimum and maximum amounts of pheromone that can be deposited, and k is a tuning parameter. Experimentally, the best results were obtained with k = 1.5, d_min = 1, and d_max = 6.

It is important to note here that in the second pass of the two-pass version of our algorithm, the iteration's best schedule is selected from the schedules that satisfy the occupancy constraint. If no schedule in a given iteration satisfies the occupancy constraint, no pheromone table update will take place at the end of that iteration.

In order to simulate the decay of pheromones over time, the following formula is applied to each link in the pheromone table at the end of each iteration: (3) \[\begin{equation} {{\rm{\tau }}_{{\rm{ij}}}}{\rm{\ }} = {\rm{\ min}}( {{\rm{max}}( {{{\rm{\tau }}_{{\rm{ij}}}}( {1 - \rho } ),{{\rm{\tau }}_{{\rm{min}}}}} ),{{\rm{\tau }}_{{\rm{max}}}}} ) \end{equation}\]where ρ is the decay rate, τ_min and τ_max are the minimum and maximum amounts of pheromone that a link in the table can have. Experimentally, the best results were obtained with ρ = 0.1, τ_min = 1, τ_max = 8.

4.2 Heuristics

In the proposed ACO algorithm, various heuristics are used to both construct the initial solution and guide the ACO search, including the Critical-Path (CP) heuristic [Cooper and Torczon 2011] and the Last Use Count (LUC) heuristic described in previous work [Shobaki et al. 2015].

The CP heuristic is a commonly used heuristic for minimizing schedule length (exploiting ILP). The CP of an instruction is the length of the longest path between the instruction and a leaf node in the DDG. Thus, the CP measures the length of the dependence chain below an instruction. When multiple instructions are ready, selecting the instruction with the longest dependence chain below it increases the chances of hiding long latencies and consequently minimizing the schedule length.

The LUC heuristic is an intuitive heuristic for minimizing register pressure. The LUC of an instruction is the number of live ranges that the instruction closes. When multiple instructions are ready, selecting the instruction that closes the maximum number of live ranges is likely to minimize register pressure.

Both the CP and LUC are greedy heuristics that are not guaranteed to produce optimal solutions, even if we consider a single objective (schedule length or RP). Of course, the problem of optimizing two conflicting objectives is much more complex than the problem of minimizing one objective.

LLVM's scheduling algorithm is a list scheduling algorithm [Cooper and Torczon 2011] that is based on maintaining a ready list of instructions and selecting the next instruction to add to the schedule according to certain priority schemes, including schemes that are similar to LUC and CP. The LLVM algorithm gives higher priority to reducing register pressure, and thus its behavior tends to be similar to that of the LUC heuristic, that is, it produces low-RP schedules that can be too long. The AMD algorithm is based on the LLVM algorithm. It extends the LLVM algorithm to produce better schedules for an AMD GPU by balancing RP and ILP. It also involves multiple AMD-GPU-specific enhancements.

5 EXPERIMENTAL RESULTS

The proposed ACO algorithm was implemented in the LLVM compiler as an alternative pre-allocation scheduler. In this section, we present the results of our experimental evaluation. The proposed ACO algorithm was compared to the exact B&B algorithm proposed in previous work [Shobaki et al. 2013, 2019]. To show the importance of RP awareness, the evaluation also includes the CP algorithm that only considers ILP and does not consider RP.

5.1 Experimental Setup

The ACO scheduler and the B&B scheduler were implemented in LLVM as alternative schedulers to LLVM's machine-level pre-allocation scheduler. Three target architectures are included in the evaluation: x86-64, ARM, and an AMD GPU. For x86 and ARM, both algorithms are evaluated relative to LLVM's generic scheduler, while for the AMD GPU, the algorithms are evaluated relative to AMD's production scheduling algorithm, which is an extension of the LLVM algorithm. For the CPU targets, the single-pass version of the algorithm was used, while for the GPU target, the two-pass version was used.

The hardware and software configurations and the benchmarks used for each target architecture are shown in Table 1. The benchmarks used in the evaluation are SPECrate 2017 Floating Point¹ (FP2017) for CPU targets and PlaidML for the GPU target. The clock speed for the ARM target was reduced to 600 MHz to avoid the thermal throttling that caused random variation in execution times. The −O3 optimization level was used in all tests. At this optimization level, LLVM invokes a global greedy register allocator.

Table 1.

Architecture	Processor	Benchmarks	LLVM Version	OS
X86	Intel(R) Core (TM) i9-9900X @ 3.50 GHz	FP2017	LLVM 7.0	Ubuntu 18.04.5
ARM	Broadcom BCM2711 Cortex A-72 @600 MHz (underclocked) Cross compilation was done on an Intel Core i7-7700K processor @ 4.2 GHz	FP2017	LLVM 7.0	Ubuntu 20.04.1 LTS
AMD GPU	AMD Radeon RX Vega 64 GPU @ 1.63 GHz Compilation was done on an AMD Ryzen Threadripper 1950X processor	PlaidML	roc-ocl-2.4.0	Ubuntu 18.04.5

STAT	FP2017	PlaidML
Number of benchmarks	12	13
Number of functions/kernels	66,422	3,814
Number of scheduling regions	1,123,793	16,682
Avg. scheduling regions per function	16.9	4.4
Avg. Instructions per scheduling region	7.1	49
Max. Instructions per scheduling region	6193	921

ALGORITHM	PERP	SLIL	APRP
B&B	2.38%	43.00%	17.94%
ACO	1.11%	39.93%	18.98%
CP	−18.96%	−140.76%	−23.15%

DESCRIPTION	COUNT	Percentage
1. Total instances passed to B&B and ACO	561,131
2. Optimal by both (easy)	511,491	91.2% of total
3. Hard instances	49,640	8.8% of total
4. Optimal by B&B only	37,738	76% of hard
5. B&B timeouts	11,902	24% of hard
6. B&B timeouts with ACO and B&B equal	815	7% of timeouts
7. B&B timeouts with B&B better	6,552	55% of timeouts
8. B&B timeouts with ACO better	4,535	38% of timeouts

STAT	VALUE
Total functions	66,422
Functions with spills	12,096 (18.2%)
Sum of minima (SOM)	256,738
Avg. spills per function	3.9
Avg. spills per spilling function	21.2

Algorithm	B&B	ACO
Average occupancy	7.79	7.82
%Improvement relative to AMD	6.6%	7.0%
Kernels with better occupancy	218 (5.7%)	339 (8.9%)

Algorithm	B&B	ACO
Average schedule length	150.4	165.2
%Improvement relative to AMD	6.65%	−2.88%
Regions with better schedule length (won)	4,200	318
Avg. size of regions won	116.8	230.4
Avg. win margin	74.4	208.4

Algorithm	B&B	ACO many ants
Average schedule length	150.4	151.9
%Improvement relative to AMD	6.65%	5.60%
Regions with better schedule length (won)	2,381	350
Avg. size of regions won	164.1	232.4
Avg. win margin	63.1	357.9

Benchmarks	LLVM	B&B	ACO	ACO many ants
X86 FP2017 (hot only)	1055	2040	1526	N/A
ARM FP2017 (hot only)	1016	2014	2565	N/A
AMD GPU PlaidML	256	496	588	37,239

Register-Pressure-Aware Instruction Scheduling Using Ant Colony Optimization

ACM Transactions on Architecture and Code Optimization

Abstract

1 INTRODUCTION

2 BACKGROUND

2.1 Single-Pass Approach

2.2 Two-Pass Approach

2.3 Register-Pressure Cost Functions

3 RELATED WORK

4 ALGORITHM DESCRIPTION

4.1 ACO Applied to Instruction Scheduling

4.2 Heuristics

5 EXPERIMENTAL RESULTS

5.1 Experimental Setup

5.2 Benchmark Statistics

5.3 Register-Pressure Cost

5.4 Spills Generated by the Register Allocator

5.5 GPU Occupancy and Schedule Length

5.6 Execution Times

5.7 Compile Times

6 CONCLUSIONS AND FUTURE WORK

Footnotes

REFERENCES

Cited By

Index Terms

Recommendations

Graph transformations for register-pressure-aware instruction scheduling

Preallocation instruction scheduling with register pressure minimization using a combinatorial optimization approach

Exploring an Alternative Cost Function for Combinatorial Register-Pressure-Aware Instruction Scheduling

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media