1 Introduction
-
systematically study the state-of-the-art software optimization methods for parallel computing systems that use machine learning or meta-heuristics;
-
classify the existing studies based on the software life-cycle activities (compile-time, and run-time), target computing systems, optimization methods, and period of publication;
-
discuss existing challenges and future research directions.
2 Research methodology
2.1 Research questions
2.2 Search and selection of literature
2.3 The focus and scope of the literature review (selection process)
-
publications that investigate the use of machine learning or meta-heuristics for software optimization of parallel computing systems;
-
publications that contribute to compile-time activities (code optimization and code generation), and run-time activities (scheduling and adaptation) of software life-cycle;
2.4 Data extraction
Data item | Description | |
---|---|---|
1 | Date | Date of the data extraction |
2 | Bibliographic reference | Author, Year, Title, Research Center, Venue |
3 | Type of article | Journal article, conference paper, workshop paper, book section |
4 | Problem, objectives, solution | What is the problem; what are the objectives of the study; how the proposed solution works? |
5 | Optimization Technique | Which Machine Learning or Meta-heuristic algorithm is used? |
6 | Considered features | The list of considered features used for optimization |
7 | Life-cycle Activity | Code Optimization, Code Generation, Scheduling, Adaptation? |
8 | Target architecture | Single/Multi-node system, Grid Computing, Cloud Computing |
9 | Findings and conclusions | What are the findings and conclusions? |
10 | Relevance | Relevance of the study in relation to the topic under consideration |
3 Taxonomy and terminology
3.1 Parallel computing systems
3.2 Software optimization approaches
3.2.1 Meta-heuristics
3.2.2 Machine learning
3.3 Software optimization at different software life-cycle activities
3.4 Classification based on architecture, software optimization approach, and life-cycle activity
Machine learning | |||
Meta-heuristics | |||
2000–2005 | 2006–2011 | 2012–2017 |
Code Optimization | |||
Code Generation | |||
Scheduling | |||
Adaptation | [91] | ||
2000–2005 | 2006–2011 | 2012–2017 |
4 Compile-time
4.1 Code optimization
References | Algorithm | Objectives | Features | On/Off-line |
---|---|---|---|---|
[69] | DT | Identify loops to unroll | Loop characteristics (# memory accesses; # arithmetic operations; # statements; # control statements; # iterations) | Off-line (sup.) |
[87] | NN, SVM | Select the most beneficial loop unroll factor | Loop characteristics (# floating point operations; # operands; # memory operations; critical path length; # iterations) | Off-line (sup.) |
[17] | RSI | Determine whether to apply instructions scheduling | Code-block characteristics (# instructions; # branches; # calls; # stores; # returns; int/float/sys_func_unit instructions) | Off-line (sup.) |
PSD | Determine the most effective compiler optimizations | Static program features (# basic blocks in a method; # normal/critical/abnormal CFG edges; # int/float operations) | Off-line (sup.) | |
[57] | kNN | Determine the best partitioning strategy of irregular applications | Static program features (# basic blocks; # instructions; loop probability; branch probability; data dependency) | Off-line (sup.) |
[99] | NN | Determine the best partitioning strategy of streaming app. | Program features (pipeline depth; split-join width; pipeline/split-join work; # computations; # load/store ops) | Off-line (sup.) |
[95] | SVM | Determine whether parallelism is beneficial; select the best scheduling policy | Static program features (# instructions; # load/store; # branches; # iterations); dynamic program features (# data accesses; # instructions; # branches) | Off-line (sup.) |
[1] | IIDM; MM; NN | Reduce the number of required program evaluations in iterative compiler optimization; analyze program similarities | Program features (type of nested loop; loop bound; loop stride; # iterations; nest depth; # array references; # instructions; # load/store/compare/branch/divide/call/generic/array/memory copy/other instructions; int/float variables) | Off-line (sup.) |
[88] | GP | Tuning compiler heuristics | hyper-block formation features; register allocation features; data pre-fetching features. | Off-line (unsupervised) |
[21] | GrA; GA; HC RP | Tuning the compilation process through adaptive compilation | – | Off-line |
PRO | Tune generated code; determine the best compilation parameters | Architectural parameters (cache capacity; register capacity); application specific parameters | On-line |
4.2 Code generation
-
generate device-specific code from other code representations,
-
generate multiple implementations of the same code, or
-
automatically generate parallel code from sequential code.
References | Algorithm | Objectives | Features | On/Off-line |
---|---|---|---|---|
[7] | DT | Generate device-specific code from high-level code; map applications to accelerating devices. | Loop (kernel) characteristics (data precision, amount of computation performed and memory access characteristics) | Off-line (sup.) |
[19] | kNN | Generate multi-threaded loop versions; select the most suitable one at run-time | Static code features (loop nest depth, # arrays used); dynamic features (data set size) | off-line (sup.) |
[31] | NB, SVM, MPL, CSDT, LR | Source-to-source transformation of data-parallel applications; predict the efficiency and select the suitable device. | Static program features (outer/inner access/write; basic operations; ...); dynamic program features (data-to; data-from; ...) | Off-line (sup.) |
[5] | – | Enable writing multiple versions of algorithms and algorithmic choices at the language level; auto-tuning of the specified algorithmic choices; switch between the available algorithms during program execution | – | Off-line |
[58] | LR | Determine the optimal work distribution between the CPU and GPU | Runtime algorithm parameters (input size) and hardware configuration parameters | On-line |
[81] | – | Distribute data-parallel portions of a program across heterogeneous computing resources; | – | – |
[77] | LRPR | Determine the list of program method transformations that result in lower compilation time | General program features (# instructions; # load/store operations; # float operations); loop-based features (# loops types; # loop statements) | Off-line (sup.) |
4.3 Observations, challenges, and future directions
Method | Advantages |
---|---|
Machine learning
| |
Decision trees | |
Support Vector Machines | Stephenson and Amarasinghe [87] report that SVM and NN can predict the optimal unroll factor for a given loop 65% of the time or the near optimal one 79% of the time. Tournavitis et al. [95] uses SVM to decide whether to parallelize loop candidates and achieves 96% of the performance of hand-tuned code. Fonseca and Cabral [31] report 92% of prediction accuracy when using SVM to decide whether kernels should be executed on GPU or CPU |
(k) Nearest Neighbor | Liu et al. [57] report 5.41% performance improvement compared to hard-coded compiler optimizations. Wang and O’boyle [99] use NN to predict the partitioning structure of applications. The authors achieve up to \(1.9 \times \) speedup compared to the default partitioning strategy, which is 60% of the ideal one. Chen and Long [19] report that 87% of the highest performance improvement can be achieved using NN |
Ruled Set Induction | Cavazos and Moss [17] use RSI to determine whether or not to apply instruction scheduling on a code block and reported achievement of 90% performance improvement compared to schedule always method |
Regression Based Algorithms | Pekhimenko and Brown [77] use regression techniques to determine the optimal heuristic parameters and report two fold speedup of the compilation process while maintaining the same code quality |
Decision Tables | In Fonseca and Cabral [31], the up to 92 % prediction accuracy of DT, NB, and MLP to decide whether kernels should be executed on the GPU or CPU is significant to achieve 65x speedup over sequential Java programs |
Predictive Search Distribution | |
Meta-heuristics
| |
Greedy Algorithm and Hill Climbing | Cooper et al. [21] use meta-heuristics to find the optimal compilation parameters while reducing the number of evaluations during search space exploration from 10000 to a single one using profiling data and estimated virtual execution |
Genetic Algorithm | Sandrieser et al. [88] obtain speedup of 23% for hyper-block formation |
Parallel Rank Ordering |
References | Focus | Limitations |
---|---|---|
Single aspects of code optimization | Control single/few and simple optimization (such as: loop unrolling; instruction scheduling; hyper-block formation). Considering multiple and more complex compiler optimizations is more challenging | |
Determining the most effective compiler optimization | Training is based in random data sampling and it requires large number of samples, which may reduce its effectiveness | |
Determining the best partitioning strategy | Assumes that for any two functions with similar features, the same partitioning strategy can be used | |
[95] | Determining loops that benefit from parallelization and their best scheduling policy | Targets OpenMP loop constructs only. Uses profiling to detect loop candidates, which may significantly increase the compilation time |
Adaptive tuning of the compilation process | Profiling data needs to be collected to perform the virtual executions. Takes too long to find the optimal transformations. Efficient for simple models, but results in lower prediction accuracies for more complex problems | |
Device-specific code generation; mapping applications to accelerators | The prediction model requires significant training data for accurate mapping decisions. Lack of training data may result in performance degradation | |
[19] | Generating multi-threaded versions of the loop | The size of the executable file may dramatically increase for applications with large parallel code, and hardware architectures that consist of multiple multi-core processing units |
Source-to-source transformation | Limited to map-reduce operations. The automatic code generation is limited to specific features of Java code. Not all Java code can be translated to OpenCL | |
[5] | Auto-tuning; run-time library | PetaBricks requires that developers write their application using non widely known programming languages. The performance of PetaBricks is closely dependent on the architecture where the auto-tuning was performed |
5 Run-time
5.1 Scheduling
5.1.1 Static scheduling
References | Algorithm | Objectives | Features | On/Off-line |
---|---|---|---|---|
[98] | ANN; SVM | Mapping computations to multi-core CPUs; determine the optimal thread number; | Code features (# static instructions; # load/store operations; # branches); data and dynamic features (L1 data cache miss rate; branch miss rate) | Off-line (sup.) |
[40] | SVM | Mapping computations to the suitable processing device | Static code features (# int/float/math operations; barriers; memory accesses; % local/coalesced memory accesses; compute-memory ratio) | Off-line (sup.) |
[15] | ID3 DT | Mapping threads to specific cores; reduce memory latency and contention | Program features (transaction time ratio; transaction abort ratio; conflict detection policy; conflict resolution policy; cache misses) | Off-line (sup.) |
[71] | L; MP; IB1; IBk; KStar ... | Reducing the training data; select the most informative training data; mapping application to processors | – | Off-line (sup.) |
[64] | BDTR | Determine workload distribution of data-parallel applications on heterogeneous systems | Hardware configuration (# threads; # cores; # threads/core; thread affinity); application parameters (input size) | Off-line (sup.) |
[62] | SA | Determine near-optimal system configuration parameters of heterogeneous systems | System configuration parameters (#threads/thread_affinity/ workload_fraction on host/device) | Off-line (sup.) |
BDTR; SA | Determine near-optimal system configuration on heterogeneous systems | Available resources; scheduling policy; and the workload fraction | Off-line (sup.) | |
GA | Task scheduling | – | Off-line (sup.) |
5.1.2 Dynamic scheduling
References | Algorithm | Objectives | Features | On/Off-line |
---|---|---|---|---|
[30] | ANN | Determine the best number of threads | Static features (# load/store ops; # instructions; # branches); dynamic features (# processors; # workload threads; run queue length; ldavg-1; ldavg-5) | Off-line (sup.) |
[54] | R&F | Determine the application execution time in shared environments | Program input parameters; # processors; resource status | Off-line (sup.) |
[76] | SVM | Mapping tasks to processing devices | # tasks in the queue; the ready times of the machines; computing capabilities of each machine | Off-line (sup.) |
[60] | GrA | Evenly partitioning tasks between high performance clusters and the cloud | Estimated execution time determined by monitoring the actual exec. time of data or tasks chunks | On-line |
[61] | – | Predicting resource allocation for business processes in the Cloud | Runtime metrics of a process and its behavior | Off-line |
[16] | ID3 DT | Predicting a thread mapping strategy for STM applications | Transactional Time/Abort Ratio; Conflict Detection/Resolution Policy; Last-Level Cache Miss | Off-line (sup.) |
[43] | ANN | Improve the effectiveness of grid scheduler decisions | Characteristics of the tasks and machines | Off-line |
[44] | ANN; GA | Improve the makespan | Security demands, workload of task, and the output size | Off-line & On-line |
[36] | LR | Improving the scheduling algorithms using machine learning techniques | Job arrival time; required resources; # running jobs; occupied resources | On-line |
[11] | – | Optimize the task scheduling on heterogeneous platforms | Input data; data transfers; task performance; platform features | Off-line & On-line |
[41] | ANN | Predict the optimal number of threads | Program features and workload features | Off-line |
Adaptive LR | Determine the number of threads and scheduling policy for each parallel region | Inter-thread data locality, instruction mix and load imbalance | – | |
GA | Minimize the make-span; dynamic task scheduling in heterogeneous systems | Task properties(arrival time; dependency); system properties (network; processors) | On-line | |
[4] | Adaptive GrA | Mapping of computation kernels on heterogeneous GPUs accelerated systems. | Profiling information (execution time; data-transfer time); hardware characteristics | Off-line |
[56] | HC | Selecting optimal per task system configuration for MapReduce applications | Map-reduce parameters (# mappers; # reducers; slow start; io.sort.mb; # virtual cores) | On-line |
[85] | PSO; SA | Dynamic scheduling of heterogeneous tasks on heterogeneous processors; load balancing; | Task properties (execution time; communication cost; fitness function); hardware properties (# processors) | – |
[104] | GA | Dynamic load-balancing where optimal task scheduling can evolve at run-time | – | On-line |
[80] | – | Mapping tasks to heterogeneous architectures | Architectural trade-offs; computation patterns; application characteristics; | On-line |
[24] | PR | Dynamic scheduling and performance optimization for heterogeneous systems | Kernel execution time; machine parameters; input size; input distribution var.; instrumentation data; | Off-line (supervised) |
LR; QR | Prediction of performance aspects (e.g. execution time, power consumption) of implementation variants | System information (resource availability and requirements; estimated performance of implementation variants; input availability) | Off-line (supervised) | |
[55] | DT | Reducing the number of training data required to build prediction models | Input parameters (e.g. size); system available resources (e.g. # cores; # accelerators) | Off-line (supervised) |
DT; DD; NB; SVM | Use meta-data from performance aware components to predict the expected execution time; select the best implementation variant and the scheduling policy | Input parameters (e.g. size); system available resources; meta-data | Off-line (supervised) |
5.2 Adaptation
References | Method | Adaptation objectives | Monitored parameters | Tuned parameters |
---|---|---|---|---|
[91] | Custom adaptation loop; DT | Select the most suitable algorithm implementation | Architecture and system information (available memory, cache size, # processors); performance characteristics | Algorithm implementation |
ODA | Apply user defined actions to change the program behavior | Performance information retrieved using application heartbeats [46] | User defined actions (such as: adjusting the clock speed) | |
[28] | Lock Acquisition Scheduling; RL | Adapt the lock’s internal implementation | Reward signal (heart rate) retrieved using application heartbeats | Change the lock scheduling policy |
[29] | RL | Determine the ideal data structure knob settings | Reward signal (throughput heart rate); support for external perf. monitors | Adjusting the scancount value and performance-critical knob of Flat Combining algorithm |
[58] | LR | Adaptive mapping of computations to PE | Execution-time of parts of the program | Choosing the mapping scheme (static or adaptive) |
[84] | DSL | Adapt applications to meet user defined goals | Contextual information, requirements, resources availability | User defined actions (altering resource alloc. and task mapping) |
5.3 Observations, challenges and research directions
Method | Advantages | |
---|---|---|
Machine Learning | Artificial Neural Network | Emani et al. [30] report speedup of up to \(3.2\times \) compared to OpenMP default scheme, and \(2.3\times \) compared to Hill Climbing on-line adaptation technique. Grzonka et al. [43] show that the ANN can be used to reduce the time required to find the best possible solutions by approximately 30–40%. Grewe et al. [41] show that their neural network is aware of existing workload and can reduce the slowdown to existing workload from 4.5 to 0.5% at a cost of reducing the speedup from \(1.66\times \) to \(1.59\times \) |
Support Vector Machines | Grewe and O’Boyle [40] report performance achievement of 80.6% compared to the optimal one. Wang and O’Boyle [98] use ANN and SVM to determine the best number of threads and show performance achievements of up to 96% compared to the optimal performance. Kessler and Löwe [51] show that the SVMs can be used to select the best optimization variant with 0% inaccuracy, however the decision overhead is high | |
Decision Trees | Castro et al. [15] show performance improvement of up to 18.46% compared to the worst case scenario. Memeti and Pllana [64] can determine a near-optimal workload distribution on heterogeneous system, which results in performance improvement of up to \(35.6\times \) compared to sequential version. Thomas et al. [91] show that a performance accuracy between 86 and 100% is capable to dynamically optimize the execution time by choosing the most suitable algorithm in a given context | |
Regression | Gaussier et al. [36] can predict the execution time, which help to achieve up to 28% makespan reduction. Zhang et al. [103] show performance improvement up to 27% when using regression techniques to predict the optimal number of threads and scheduling policy. Luk et al. [58] use regression techniques to map computations to processing units, which result in performance improvement up to 40% compared to mapping always to CPU, 25% compared to GPU-always, and within 94% of the near optimal mapping | |
Reinforcement Learning | ||
Meta-heuristics | Simulated Annealing | Memeti and Pllana [63] use simulated annealing to optimize the workload distribution on heterogeneous systems. By evaluating only about 5% of all possible configurations it can achieve average speedup of 1.6\(\times \) and 2\(\times \) compared with the host-only and device-only execution |
Genetic Algorithms | ||
Greedy Algorithm | Mantripragada et al. [60] predicts the application execution time, and allows to dynamically shift part of the workload from the cluster to be computed in the cloud, in order to meet the deadline. Albayrak [4] show that nine out of ten times the mapping algorithm based on GrA performs better than single-device mapping | |
Hill Climbing | Li et al. [56] shows performance improvement of up to 30% compared to the default configurations used by YARN | |
Particle Swarm Optimization | Sivanandam et al. [85] uses PSO and SA for task scheduling. The hybridization of these algorithms outperforms other algorithms, including GA |
References | Focus | Limitations |
---|---|---|
Mapping computations to the most suitable processing units | The mapping process is dependent on the hardware architecture, which means that a mapping scheme that fits well an architecture may not yield the desired performance on another architecture. It requires to re-learn the prediction model for each new application and architecture configuration | |
Determining the optimal number of threads | ||
[15] | Determining thread mapping strategy for TM applications | Since it uses static features, it can not change the mapping strategy when the parallelism degree changes at runtime |
[56] | Determining near-optimal system configuration parameters | Requires extensive profiling data collection and analysis, which may introduce significant run-time overhead |
Dynamic task scheduling | The proposed approaches do not consider dependencies between tasks | |
[80] | Dividing tasks into chunks and scheduling in task-farm way | Task-farm or master-slave like scheduling techniques requires no profiling, however they introduce communication overheads |
Task scheduling | Assume that communication time is known prior execution, processing units have equal computing power and are always ready to perform tasks, and scheduling can be determined off-line and it can not be changed at run-time | |
[71] | Reducing the number of required training data | This solution requires additional programming investment to build the prediction committee |
[4] | Mapping of computation to heterogeneous systems | Collecting profiling information for each kernel on every device represent huge overhead, especially on heterogeneous systems that may comprise larger number of non-identical CPUs and/or GPUs |
[91] | Adaptive algorithm selection | Considers hardware characteristics, but not the program input characteristics. Some algorithms perform better for smaller input sizes, whereas others perform better with larger ones |
Self-adaptation | Require adding additional information to the code. Require running the application for a certain amount of time (often with non-optimal parameters) until the framework takes a more optimal decision |
6 Conclusion
Advantages | Limitations | |
---|---|---|
Compile-time | \(+\) Source-to-source code generation and optimization may outperform manually written code | − Some approaches focus on single and simple compiler optimizations. Considering multiple and more complex compiler optimizations is more challenging |
\(+\) Selecting the best compiler optimizations results in better performance compared to hard-coded compiler optimizations | − Training requires large amount of data or extensive profiling of the code. Lack of training data may result in performance degradation | |
\(+\) In comparison to techniques that assume that each code block benefits from certain optimizations, performance improvement is observed for approaches that intelligently determine for each code block whether such optimizations result in better performance | − Some approaches generate multiple versions of the code, and select the most optimal one at runtime. The code-size may dramatically increase | |
Run-time | \(+\) Co-scheduling can reduce the slowdown of other applications | − Some of approaches consider only few aspects of the system, such as code features, but not hardware characteristics, system utilization, or task dependencies |
\(+\) Selecting the best system configurations or algorithm implementation variant for a given execution context may result in better performance and reduce the time required for application tuning | − Some approaches assume that applications will be executed in isolation, whereas real-world applications may share the same resources with other applications | |
\(+\) Adaptation approaches may change the program behavior during execution to meet certain user-defined goals | − Some approaches ignore slow processing elements, which may result in overall system underutilization |