Introduction
-
Processor cycle time.
-
Number of instructions required to perform certain task.
-
Number of cycles required to complete an instruction.
Related work
Proposed algorithm
-
In-order execution: In this method, instructions are fetched, executed and completed in a compiler generated order. Instructions are scheduled statically, and if one stall occurs, the rest all are stalled. The Alpha and Intel Atom processors in-order models have been implemented with peak performance. However, complex designs are required for the integration of peak capacity and increase in clock frequency.
-
Out-of-order execution: In the previous method, data dependencies and latencies in functional units can cause reduction in the performance of the processor. In order to overcome this issue, we have uses an out-of-order (OOO) method which is the traditional way to increase the efficiency of pipelined processors by maximizing the instruction issued by every cycle [24]. But, this technique is very costly in terms of its implementation. Most of the high level processors (such as DEC and HP) execute the instructions in out-of-order. In this method, the instructions are fetched in a compiler generated order and the execution of the instruction takes place in pipeline as one which is not dependent on the current instruction, i.e. independent instructions are executed in some other order. The instructions are dynamically scheduled and the completion of instruction may be in in-order or out-of-order.
-
Tomasulo’s algorithm: As a dynamic scheduling algorithm, it uses a hardware-based mechanism to remove the stalls at the runtime. It allows sequential instructions that would normally be stalled due to certain dependencies to execute non-sequentially (out-of-order execution). It also utilizes the concept of register renaming, and resolves Write-after-Write (WAW), Read-after-Write (RAW) and Write-after-Read (WAR) computer architecture hazards by register renaming, which allows the continual issuing of instructions. This algorithm also uses a common data bus (CDB) on which the computed values are broadcasted to all the reservation stations that may need it. This allows for improved parallel execution of instructions, which may otherwise stall. The Tomasulo algorithm was chosen because its order of instruction execution is nearly equivalent to that of our proposed algorithm. Both algorithms are scheduled statically at the micro-architecture level.
-
Proposed algorithm (LR(Left-Right)): The above mentioned methods and algorithms have their own merits and demerits while executing an instruction in a pipelined processors. Instead of using some other methods to reduce the power consumption, we have proposed an algorithm which performs the stall reduction in a Left-Right (LR) manner, in sequential instruction execution as shown in Figure 1. Our algorithm introduces a hybrid order of instruction execution in order to reduce the power dissipationl. More precisely, it executes the instructions serially as in-order execution until a stall condition is encountered, and thereafter, it uses of concept of out-of-order execution to replace the stall with an independent instruction. Thus, LR increases the throughput by executing independent instructions while the lengthy instructions are still executed in other functional units or the registers are involved in an ongoing process. LR also prevents the hazards that might occur during the instruction execution. The instructions are scheduled statically at compile time as shown in Figure 2. In our proposed approach, if a buffer in presence can hold a certain number of sequential instructions, our algorithm will generate a sequence in which the instructions should be executed to reduce the number of stalls while maximizing the throughput of a processor. It is assumed that all the instructions are in the form of op-code source destination format.××
Comparison of LR vs. Tomasulo algorithm
Simulation and power-performance evaluation
In-order execution
|
Tomasulo’s algorithm
|
Proposed algorithm (LR)
|
---|---|---|
Static-scheduling | Hardware dynamic-scheduling | Static-scheduling |
Compiler tries to reorder theinstructions during the compilation time in order to reduce the pipeline stalls | The dynamic scheduling of the hardware tries to rearrange the instructions during run-time to reduce the pipeline stalls | Compilation time instructionexecution |
Uses less hardware | More hardware unit added | Use more powerful algorithmic techniques (sorting) |
Sequential-order | Register-renaming is used to reduce the stall | Sorting takes place first, then execution of an instruction |
Bottom-up approach | Re-ordering of CPU instructions | Hybrid order of an in-orderand OOO |
For ex: char x; //read x, starts on cycle 1 & completes on cycle 2; int a= 10 + 20; // assignment to a, starts on cycle 3 & completes on cycle 4; print char x; // starts on cycle 5 & completes on cycle 6; | char x; // read x, starts on cycle 1 & completes on cycle 2; int a= 10 + 20; // assignment to a, starts on cycle 2 & completes on cycle 3; print char x; // starts on cycle 3 & completes on cycle 4; | char x; // read x, starts on cycle 1 & completes on cycle 2; int a= 10 + 20; // assignment to a, starts on cycle 2 & completes on cycle 3; print char x; // starts on cycle 3 & completes on cycle 4; Due to hardware unit, more power dissipation |
Performance evaluation
Metrics
|
LR
|
In-order
|
Tomasulo
|
---|---|---|---|
Instructions per cycle (IPC) | 0.2462 | 0.2442 | 0.7327 |
Clocks per instruction (CPI) | 4.061 | 4.09 | 1.3648 |
Simulation speed (inst/sec) | 785 | 606 | 98021.9 |
Total number of instructions executed | 40516 | 41004 | 2350639 |
Time/Program | 164535.476 | 167706.36 | 3207982.549 |
Power consumption evaluation
Component
|
Simulation parameters
|
LR
|
In-order
|
Tomasulo
|
---|---|---|---|---|
ALU
|
alu avg power # Avg powerfor alu
|
0.0001
|
0.0001
|
0.0003
|
dl1
| dl1.avgswitching #dl1 avgin switching power dissipation | 0.0060 | 0.0079 | 0.0172 |
dl1.avginternal #dl1 avg internal power dissipation | 0.0192 | 0.0199 | 0.0431 | |
dl1.avgleakage #dl1 avg leakage power dissipation | 0.0029 | 0.0029 | 0.0029 | |
dl1.avgpdissipation #dl1 avg power dissipation | 0.223 | 0.2208 | 0.2459 | |
il1
| il1.avgswitching #il1 avg in switching power dissipation | 0.0343 | 0.0338 | 0.0950 |
il1.avginternal #il1 avg internal power dissipation | 0.0861 | 0.0849 | 0.2384 | |
il1.avgleakage #il1 avg leakage power dissipation | 0.0029 | 0.0029 | 0.0029 | |
il1.avgpdissipation #il1 avg power dissipation | 0.4274 | 0.4292 | 0.4525 | |
irf
| irf.avgswitching # irf avg in switching power dissipation | 0.0063 | 0.0066 | 0.0162 |
irf.avginternal #irf avg internal power dissipation | 0.0085 | 0.0089 | 0.0218 | |
irf.avgleakage #irf avg leakage power dissipation | 0.0001 | 0.0001 | 0.0001 | |
irf.avgpdissipation #irf avg power dissipation | 0.041 | 0.0414 | 0.0407 | |
Clock
| clock.avgleakage #clock avgleakage power dissipation | 191.13 | 191.13 | 191.13 |
Component
|
LR algorithm
|
In-order algorithm
|
Tomasulo
|
(%) Improvement (Tomasulo)
|
---|---|---|---|---|
ALU
| 0.0001 | 0.0001 | 0.0003 | 33.3 |
il1
| 0.5507 | 0.5508 | 0.3091 | 21.1 |
dl1
| 0.2527 | 0.2515 | 0.7888 | 33.5 |
irf
| 0.0559 | 0.057 | 0.0788 | 33.5 |