Introduction
Basic Processor Architecture
Details of the Processor Stages
|
No pipelining |
With pipelining |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
There are two observations to be made here. First of all, instruction i takes 1 (long) clock cycle in the non-pipelined processor, and it takes 5 (short) clock cycles in the pipelined processor. Note that this "short" cycle is 1/5 of the "long" cycle, so it actually takes just as long to execute an instruction in the non-pipelined version as it does in the pipelined version.
The point is, however, that even though it takes just as long to execute one instruction, both processors finish the execution of an instruction after every clock cycle. Instruction i+1 finishes one clock cycle after instruction i for both processors. BUT: the clock cycle for the pipelined processor is a fifth of the clock cycle for the non-pipelined processor. Hence, the throughput, the number of instructions executed per unit time, is 5 times higher for the pipelined processor than it is for the non-pipelined processor. In other words, the pipelined processor is 5 times faster than the non-pipelined processor.
Try it out. To illustrate this point, watch the execution of a single instruction in the non-pipelined and in the pipelined processor. Notice that in the pipelined processor, the clock period is much shorter, and every stage now has its own set of registers. Also note that despite this, the instruction takes the same time to execute in either processor.
Try it out. Then consider the following simple program and watch the execution in both the non-pipelined and the pipelined processor and compare execution speed. You should notice that it takes 6 long cycles to execute the program in the non-pipelined processor, and 10 short cycles to execute the program in the pipelined processor.
addi r1, r0, 4 //
r1
:= 4
addi r2, r0, 5
// r2
:= 5
addi r3, r0, 6
// r3
:= 6
slli r1, r1, 2
// r1
:= r1
<< 2 (r1
:=
r1
* 4)
subi r2, r2, 3
// r2
:= r2
- 3
add r3, r3, r3 // r3
:=
r3
+ r3
Of course, it is not as simple as it seems (is anything ever?). The above program was carefully chosen for the pipelining to work. Consider the following code fragment:
add r1, r2, r3 //
r1
:=r2
+r3
sub r3, r1, r2
xor r3, r1, r2
srl r3, r1, r2
ori r3, r1, 10If we look at the pipeline during the executing of these instructions, we see:
Cycle IF ID EX MA WB 1 add 2 sub add 3 xor sub add 4 srl xor sub add 5 ori srl xor sub add 6 ori srl xor sub In cycle 3, the
sub
instruction reads the value of registerr0
. Theadd
instruction, however, whose destination register isr0
, has not reached the write-back phase yet. Hence, thesub
will read the wrong value forr0
!This situation is called a data hazard. There is another data hazard cycle 4 (which instructions?). The
srl
in cycle 5 is OK, because theadd
will write the result back to the register file in the first half of the clock cycle, and thesrl
will read its operands in the second half of the clock cycle (two phase memory access). It should be clear that theori
in cycle 6 is OK too - theadd
has finished execution completely in this cycle.The easiest solution is to ignore the problem. While that might sound like a bad solution, it is actually quite usable. Consider the above program again, and assume that
r1
is initially 1, andr2
is initially 2. Then:add r0, r1, r2 //
r0
:= 3
sub r2, r0, r1 //r2
:= -1 (uses old value ofr0
)
xor r2, r0, r1 //r2
:= 1 (uses old value ofr0
)
srl r2, r0, r1 //r2
:= 6 (uses new value ofr0
)
ori r2, r0, 10 //r2
:= 11 (uses new value ofr0
)It becomes a bit harder to read the source code, because the processor seems to execute instructions out of order. But as long as this behaviour is well-defined, the programmer (or the compiler) can take it into account and make good use of it. (Note that the processor does not really execute instructions out of order.)
Try it out. Follow the execution of this program in the MIPS pipeline.
A second solution is to stall the pipeline. To explain what a stall is, we need a bit of extra terminology. We can divide the pipeline into two parts, instruction fetch and instruction execute:
Instruction Fetch Instruction Execute IF ID EX MA WB When an instruction makes it from the first half of this pipeline to the second half, i.e. from the ID phase to the EX phase, it is said to be issued. On a stall, we do not issue the instruction in the ID phase. Instead, we issue a NOP ("do nothing") instruction. In the meantime, the IF and and ID phase will simply re-execute (we shall later see that it is is not quite as simple as that). The above program would be executed as follows:
Cycle IF ID EX MA WB 1 add 2 sub add 3 xor sub add We detect a stall here, and issue a nop
instead of thesub
4 xor sub nop add The add
has not yet reached WB, so we need another stalled cycle5 xor sub nop nop add The add
will write its results in the first half of the cycle, so thexor
will now get the correct value forr0
6 srl xor sub nop nop The downside is that with this solution, we need 7 cycles instead of 5 to execute our program. That's a 40% increase in execution time!
Try it out. Run this program in the animation. For clarity, the MIPS animation will issue a "
stall
" instruction instead of anop
instruction upon a stall. This is only to distinguishnop
s issued because of stalls fromnop
s that were actually in the instruction memory. In reality, thestall
instruction does not exist.
We want to be able to use the result of the ALU directly, without having to wait for the result to be written back to the register file. For this, we need to forward the result of the ALU directly back into the ALU.
Before the result gets written to the destination register, it is written into the intermediate registers OUT0 and OUT1 first (see diagram). We need to forward the contents of OUT0 and OUT1 back into the input of the ALU (connections marked 1a and 1b connect OUT0 to the ALU, and connections marked 2a and 2b connect OUT1 to the ALU). Of course, we then also need a multiplexer to select the right operand (MUX6 and MUX7).
With this solution, there are no more stalls, and the code is executed the way you would expect from reading the source code. If you don't understand it, try the animation - it should become clear.
Consider the following code fragment.
l r1, 1(r2) // Load the value at address 1+
r2
intor1
add r3, r1, r2How is this executed in the pipeline?
Cycle IF ID EX MA WB 1 lb 2 add lb 3 add lb 4 add lb 5 add lb Again, the
lb
will only write the value from memory into the register file in the WB phase, whereas theadd
retrieves its operands in the ID phase. Thus, theadd
will actually use an old value ofr1
. As before, we could ignore this problem and call it a feature, and that would be fine. We could also stall the pipeline for two cycles:
Cycle IF ID EX MA WB 1 lb 2 add lb 3 add lb We detect a stall, issue a nop
instead of theadd
4 add nop lb And another one 5 add nop nop lb The
lb
will write its result into OUT1 before it is written to the register file, however, so if we use pipeline forwarding, and we forward the result of OUT1 directly back to the ALU, we can reduce this to a one-cycle stall.
Cycle IF ID EX MA WB 1 lb 2 add lb 3 add lb We detect a stall, issue a nop
instead of theadd
4 add nop lb 5 add nop lb We cannot do better than this, simply because the value is not loaded from memory until the MA phase. Note that the load calculates the effective address of the memory operand in the ALU phase; in the above example, it would calculate 2+
r2
.Try it out. Consider each solution (ignoring the problem, interlock and interlock with forwarding) in the MIPS pipeline.
The store instruction exhibits the same data hazards as, for example, arithmetic instructions such as
add
orsub
. Consider the following program:add r1, r2, r3
s 10(r0), r1 memory[10] :=r1
To solve the hazard, we have the usual three options:
- Copy the result from B (effectively, from the register file) to SMR and ignore any dependencies. In effect, leave the hardware as it was.
- Stall the pipeline. This results in a two-cycle penalty
- Use forwarding. This requires an additional multiplexer (MUX8) and connections from OUT0 and OUT1 (connections marked 3 and 4)
Note that, conceptually, this is not a new problem (it is the exact same data hazard as we saw before). It is treated separately here, however, because we do need a bit of extra hardware to resolve the hardware for stores.
Consider the following code sequence:
add r1, r2, r3
j l1 // Jump to (label)l1
sub r2, r3, r0
l1: xor r3, r0, r1
ori r1, r2, 10How this is executed in the MIPS pipeline?
Cycle IF ID EX MA WB 1 add 2 j add 3 sub j add The branch is resolved here, but we already fetched the sub
4 xor sub j add 5 ori xor sub j add Notice that we executed the
sub
instruction which, logically, should never have executed because of the branch (j
). This happens because the branch target is only calculated in the ID phase. This problem is called a control hazard.As with load hazards, we can decide to ignore the control hazard. Once we define that the instruction directly after the branch is executed anyway, we can take that into account and make use of it. This technique is called delayed branches, and the instruction following the branch is said to be in the delay slot.
Try it out. Run this program using delayed branches.
Alternatively, we can stall the pipeline. Remember, when we stall the pipeline, the instruction in the ID phase (which would be the branch) is not issued (a
nop
is issued instead). We also said that the IF and ID phase simply re-execute, i.e., their registers are not updated.As it turns out, this is not completely true. If the PC would not be updated, the IF phase would fetch the same instruction again (the
sub
), and we would be in the exact same situation. Thus, the PC is updated for a control hazard. The ID register and PC1 are not updated, however, because if they did, thesub
instruction would make it to the ID phase, which is exactly what we want to avoid).Here is how the program would executed using pipeline interlock.
Cycle IF ID EX MA WB 1 add 2 j add 3 sub j add We detect a stall 4 xor j nop add PC is updated so when the IF re-executes, a different instruction is loaded 5 ori xor j nop add Try it out. Run this program using branch interlock.
Finally, there is a third option, branch prediction. Branch prediction uses a Branch Target Buffer to predict what the target of the PC will be. This branch target buffer or BTB is an associative cache. Each entry in this cache lists an address of a branch instruction (the tag), together with the prediction target. This predicted target is based on the last time the branch was executed.
When the IF phase executes, it sends the contents of the PC to the BTB (1). The BTB then compares this with the addresses (tags) in its cache. If the address is in the cache, we must have executed the branch before. We then predict that the branch will do the same as the last time, so we send the associated value in the BTB to the PC (2).
If the address is not there, either because we never saw the branch before (compulsory miss), or because we did see the the branch before, but too long ago and it has been replaced by another branch in the BTB (capacity miss or conflict miss), we cannot predict the branch and we simply calculate
PC+4
(as if we did not have branch prediction, this is also called predict-not-taken, because we "predict" that we do not take the branch and simply want to execute the next instruction in memory).Of course, the prediction could be wrong. We cannot determine this until the ID phase, however. Remember, in the IF phase, we do not know what kind of instruction we are executing - this is only determined in the ID phase. Therefore, in the ID phase, we calculate the branch target anyway. We then compare the calculated branch target with the predicted PC (by comparing it to the
PC
register). If the prediction is correct, all is well, but if it is wrong (or if we did not predict the target address because of a BTB miss), we need to stall the pipeline and update the BTB (connections 3, 4 and 5).So there are 3 different situations:
- The branch is not in the BTB
We update the PC with PC+4. When the branch gets to the ID phase, we stall the pipeline to update the BTB.- The branch is in the BTB, but the prediction is wrong
We update the PC with the predicted PC. When the branch gets to the ID phase, we stall the pipeline to update the BTB.- The branch is in the BTB and the prediction is correct
We update the PC with the predicted PC. When the branch gets to the ID phase, no action is taken.Try it out. See a demonstration of branch prediction. Notice that all three situations listed above occur when the program is executed.
The MIPS processor supports two conditional branches:beqz
(branch if equal to zero) andbnez
(branch if not equal to zero). Both these instructions take an arbitrary register as a source operand. This yields another data hazard:add r1, r2, r3
bneq l1, r1
sub r2, r3, r1
l1: xor r3, r2, r1
ori r3, r1, 10Consider the execution of this program in the pipeline:
Cycle IF ID EX MA WB 1 add 2 bneq add We could use a BTB here, or predict-not-taken 3 sub bneq add Now we need to check if our prediction was correct The problems start in cycle 3. We need to check register
r1
to see if it is not equal to zero. But theadd
instruction is still in the EX phase: the value forr1
is still being calculated! As for the other hazards that we have seen, we can ignore this problem, and call it a feature.We call also stall the pipeline, but we would have to stall it for two cycles (we would have to wait until the
add
gets to the WB phase). Even then, we would only have the result towards the end of the cycle (the register file is updated halfway through the cycle, and we still have to do the comparison against zero), but that is acceptable. That is quite a long stall, however, and this situation is very common (regard theadd
as a compare instruction and you'll understand why).Finally, we could try to forward the result, as we did for data hazards. That means, we would have to forward the contents of OUT0 (connection c) and OUT1 (d) to the comparator. But that is not enough! We also need to forward the result of the ALU as it is being calculated to the comparator (b). Remember, this result will only be written to OUT0 on the next clock cycle - which is too late. This adds an asynchronous (pronounce: dodgy) element to the element, but it is necessary to avoid a stall in this - very common - situation.
Try it out. See all three possible solutions (no zero forwarding, zero interlock and zero forwarding) in the animation.
The following table lists all possible stalls exhaustively, with the conditions for the stall. For each individual stall, you can run an animation that will show a program in which the stall occurs. Note that only for the control hazards, the PC register is updated during the stall. For all other stalls, the PC register is not updated (see section on control hazards for more information).
The sample programs for each of these hazards allow to change the circuit configuration and the contents of the instruction memory, if you wish. This allows you to try and remove the hazard by changing the program or the circuit configuration.
Hazard IF ID EX MA WB Condition Demo RAW i1 i2 .., Rx, .. i3 Rx, .., .. i4 i5 No ALU forwarding Run i1 i2 .., Rx, .. i3 i4 Rx, .., .. i5 No ALU forwarding Run RAW i1 sb .., Rx i3 Rx, .., .. i4 i5 No forwarding to SMR Run (store) i1 sb .., Rx i3 i4 Rx, .., .. i5 No forwarding to SMR Run RAW i1 beqz Rx i3 Rx, .., .. i4 i5 No zero-forwarding Run (branch) i1 beqz Rx i3 i4 Rx, .., .. i5 No zero-forwarding Run i1 jr Rx i3 Rx, .., .. i4 i5 (Always) Run i1 jr Rx i3 i4 Rx, .., .. i5 (Always) Run Load i1 i2 .., Rx, .. lb Rx, .. i4 i5 Load interlock Run Control i1 j L1 i3 i4 i5 Calculated PC ≠ pc
Run i1 jr Rx i3 i4 i5 Calculated PC ≠ pc
Run i1 beqz Rx, L1 i3 i4 i5 Calculated PC ≠ pc
Run
Question 1. Why is the MA phase idle for all instructions except loads and stores?
Question 2. In the section introduction of pipelining, the same program is executed in the non-piplined and in the pipelined processor. Confirm that the speedup is 300%, and explain why this is less than the expected 500%.
Question 3. Draw a pipeline diagram to show the hazard mentioned in the section on forwarding to the SMDR. Then write a program to show this hazard, and observe the effects of the different circuit configurations (Store operand forwarding, Store interlock, and No store interlock).
Question 4. The table with all pipeline stalls lists two situations that always cause a stall. What extra hardware would you need to avoid these stalls?
Question 5. The diagram shows four control lines for multiplexer MUX3 (see the section on conditional branches). Are all of these four lines strictly necessary?
Question 6. There are quite a number of stalls during the execution shift-and-add multiplier. Rewrite the program to avoid nearly all of these stalls.
[1] Computer Architecture: A Quantitative Approach, 3rd edition, John L. Hennessy & David A. Patterson, Morgan Kaufmann Publishers [now Elsevier] [2] Logic and Computer Design Fundamentals, 2nd edition, M. Morris Mano and Charles R. Kime, Prentice Hall