Abstract
Balancing Instruction-Level Parallelism (ILP) and register pressure during preallocation instruction scheduling is a fundamentally important problem in code generation and optimization. The problem is known to be NP-complete. Many heuristic techniques have been proposed to solve this problem. However, due to the inherently conflicting requirements of maximizing ILP and minimizing register pressure, heuristic techniques may produce poor schedules in many cases. If such cases occur in hot code, significant performance degradation may result. A few combinatorial optimization approaches have also been proposed, but none of them has been shown to solve large real-world instances within reasonable time. This article presents the first combinatorial algorithm that is efficient enough to optimally solve large instances of this problem (basic blocks with hundreds of instructions) within a few seconds per instance. The proposed algorithm uses branch-and-bound enumeration with a number of powerful pruning techniques to efficiently search the solution space. The search is based on a cost function that incorporates schedule length and register pressure. An implementation of the proposed scheduling algorithm has been integrated into the LLVM Compiler and evaluated using SPEC CPU 2006. On x86-64, with a time limit of 10ms per instruction, it optimally schedules 79% of the hot basic blocks in FP2006. Another 19% of the blocks are not optimally scheduled but are improved in cost relative to LLVM's heuristic. This improves the execution time of some benchmarks by up to 21%, with a geometric-mean improvement of 2.4% across the entire benchmark suite. With the use of precise latency information, the geometric-mean improvement is increased to 2.8%.
- Barany, G. 2011. Register reuse scheduling. In Proceedings of the 9th Workshop on Optimizations for DSP and Embedded Systems (ODES'11).Google Scholar
- Barany, G. and Krall, A. 2013. Optimal and heuristic global code motion for minimal spilling. In Proceedings of the International Conference on Compiler Construction. Google ScholarDigital Library
- Berson, D., Gupta, R., and Soffa, M. 1993. URSA: A unified resource allocator for registers and functional units in VLIW architectures. In Proceedings of the IFIP Working Conference on Architectures and Compilation Techniques for Fine and Medium Grain Parallelism. 243--254. Google ScholarDigital Library
- Cooper, K. and Torczon, L. 2004. Engineering a Compiler. Morgan Kaufmann, San Fransisco, CA. Google ScholarDigital Library
- Faraboschi, P., Fisher, J., and Young, C. 2001. Instruction scheduling for instruction level parallel processors. Proc. IEEE 89, 11, 1638--1659.Google ScholarCross Ref
- Fog, A. 2012. The micro-architecture of INTEL, AMD and VIA cpus. An optimization guide for assembly programmers and compiler makers. http://www.agner.org/optimize/microarchitecture.pdf.Google Scholar
- Goodman, J. and Hsu, W. 1988. Code scheduling and register allocation in large basic blocks. In Proceedings of the International Conference on Supercomputing. Google ScholarDigital Library
- Govindarajan, R., Yang, H., Amaral, J., Zhang, C., and Gao, G. 2003. Minimum register instruction sequencing to reduce register spills in out-of-order. IEEE Trans. Comput. 52, 1, 4--20. Google ScholarDigital Library
- Havanki, W., Banerjia, S., and Conte, T. 1998. Treegion scheduling for wide-issue processors. In Proceedings of the 4th International Symposium on High-Performance Computer Architecture (HPCA'98). Google ScholarDigital Library
- Kessler, C. 1998. Scheduling expression DAGs for minimal register need. J. Comput. Lang. 24, 1, 33--53. Google ScholarDigital Library
- Langevin, M. and Cerny, E. 1996. A recursive technique for computing lower-bound performance of schedules. ACM Trans. Des. Autom. Electron. Syst. 1, 4, 443--456. Google ScholarDigital Library
- Malik, A. 2008. Constraint programming techniques for optimal instruction scheduling. Ph.D. thesis, University of Waterloo. https://cs.uwaterloo.ca/∼vanbeek/Publications/malik.pdf. Google ScholarDigital Library
- Rim, M. and Jain, R. 1994. Lower-bound performance estimation for the high-level synthesis scheduling problem. IEEE Trans. Comput.-Aid. Des. Integr. Circ. Syst. 13, 4, 451--458. Google ScholarDigital Library
- Shobaki, G. and Wilken, K. 2004. Optimal superblock scheduling using enumeration. In Proceedings of the 37th International Symposium on Microarchitecture. Google ScholarDigital Library
- Shobaki, G. 2006. Optimal global instruction scheduling using enumeration. Ph.D. dissertation, Department of Computer Science, UC Davis. http://www.cs.ucdavis.edu/research/tech-reports/2006/CSE-2006-19.pdf. Google ScholarDigital Library
- Shobaki, G., Wilken, K., and Heffernan, M. 2009. Optimal trace scheduling using enumeration. ACM Trans. Archit. Code Optim. 5, 4. Google ScholarDigital Library
- Touati, S. 2005. Register saturation in instruction-level parallelism. Int. J. Parallel Program. 33, 4, 393--449. Google ScholarDigital Library
- Weicker, R. and Henning, J. 2007. Subroutine profiling results for the CPU2006 benchmarks. ACM SIGARCH Comput. Archit. News 35, 1, 102--111. Google ScholarDigital Library
- Winkel, S. 2007. Optimal versus heuristic global code scheduling. In Proceedings of the 40th International Symposium on Microarchitecture. Google ScholarDigital Library
Index Terms
- Preallocation instruction scheduling with register pressure minimization using a combinatorial optimization approach
Recommendations
Combinatorial Register Allocation and Instruction Scheduling
This article introduces a combinatorial optimization approach to register allocation and instruction scheduling, two central compiler problems. Combinatorial optimization has the potential to solve these problems optimally and to exploit processor-...
Register-Pressure-Aware Instruction Scheduling Using Ant Colony Optimization
This paper describes a new approach to register-pressure-aware instruction scheduling, using Ant Colony Optimization (ACO). ACO is a nature-inspired optimization technique that researchers have successfully applied to NP-hard sequencing problems like the ...
Optimizing occupancy and ILP on the GPU using a combinatorial approach
CGO 2020: Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and OptimizationThis paper presents the first general solution to the problem of optimizing both occupancy and Instruction-Level Parallelism (ILP) when compiling for a Graphics Processing Unit (GPU). Exploiting ILP (minimizing schedule length) requires using more ...
Comments