|Vivio - A Tool For Creating Interactive Reversible E-Learning Animations for the WWW|
© 2003 - 2005 University of Dublin
The Instruction Fetch stage fetches the next instruction from memory using the address in the
The Instruction Decode stage decodes the instruction in the
|EX||The Execute stage "executes" the instruction. In fact, all ALU operations are done in this stage. (The ALU is the Arithmetic and Logic Unit and performs operations such as addition, subtraction, shifts left and right, etc.)|
|MA||The Memory Access stage performs any memory access required by the current instruction, So, for loads, it would load an operand from memory. For stores, it would store an operand into memory. For all other instructions, it would do nothing.|
For instructions that have a result (a destination register), the Write Back writes this result back to
the register file. Note that this includes nearly all instructions, except
Consider the execution of the following program and try to consider the operation of each of the 5 processor stages (IF, ID, EX, MA and WB) for each individual instruction. (Note that register
r0in the MIPS processor is always 0).
l r1, 0(r0) //
l r2, 1(r0) //
add r3, r1, r2 //
s 2(r0), r3 // memory :=
Try it out. Follow the execution of this program in the MIPS animation and check if you were right. If you don't understand each and every bit of the diagram yet, don't despair - each stage will be explained separately and in detail in the next section.
As I said before, the Instruction Fetch phase fetches the next instruction. First, it sends the contents of the
PCregister, which contains the address for the next instruction, to the instruction memory (1). The instruction memory will then respond by sending the correct instruction. This instruction is sent on to the next (instruction decode) phase (2).
The instruction decode phase will calculate the next PC and will send it back to the IF phase (4) so that the IF phase knows which instruction to fetch next. To be able to execute a
jrinstruction (which changes the PC to the contents of a register), we also need a connection from the register file to the PC (5).
One of these (4 or 5) is then selected (MUX1) to update the PC on the next rising clock edge. The control line for MUX1 (not shown on the diagram) would also come from the ID phase, and is based on instruction type.
The Instruction Decode phase then has two main tasks: calculate the next PC and fetch the operands for the current instruction. There are three possibilities for the next PC.
- For the most common case, for all "normal" (meaning instructions that are not branches or jumps), we must simply calculate PC+4 to get the address of the next instruction (3, ADD4).
- For jumps and branches (if the branch is taken), we might also have to add some immediate value (the branch offset) to the PC (ADDi). This branch offset is encoded in the instruction itself (6).
- For the
jalrinstructions, we need to use the value of a register instead.
MUX3 selects between the first two posibilities (+4 or +immediate), and MUX1 (in the IF phase) selects between MUX3 and the third posibility (value from the register file). The control line for MUX3 comes from a zero-detector (not shown), which in turn gets its input from the register file (thin line marked "a"). The control line for MUX1 is not shown and is based on the instruction type.
Then, the second main task is to fetch the operands for the next instruction. These operands come from the register file for all but two instructions (7, 8). The two instructions that are the exception are
jalr. These two jump instructions save the address of the next instruction in a destination register, so instead of sending an operand from the register file, we need to send the contents of the PC+4 (10).
MUX5 selects between (7) and (10). Again, the control line for this MUX (not shown) would also come from the ID phase, and is be based on the instruction type.
The Execution phase "executes" the instruction. Any calculations necessary are done in this phase. These calculations are all done by the ALU (Arithmetic and Logic Unit). The internals of these unit is beyond the scope of this tutorial, but you should find an explanation of this unit in any elementary book on digital logic, see for example .
Conceptually this stage is very simple. The ALU needs two operands. These either come from the ID phase (8, 9) and thus in turn from the register file (or PC+4 for jump and link instructions), or one operand comes from the ID phase (8) and the other comes from the instruction register (6) to supply an immediate value.
MUX7 selects between (9) and (6). The control line for this MUX is not shown, but would come from the ID phase and is based on the instruction type (it will select (9) for instructions that operate on two registers, and (6) for instructions that work one one register and one immediate value).
(11) is used for the
s(store instruction). Consider the following instruction:
s 10(r0), r2 // memory[10+
The calculation 10+
r0will be done by the ALU (so which source does MUX7 select?), and the result of this calculation will be passed on to the MA stage (12). In the MA phase, however, we will also need to know the contents of
r2. This is why we need (11).
The purpose of the Memory access phase is to store operands into memory or load operands from memory. So, for all instructions except loads and stores, the MA simply passes on the value from the ALU on to the WB stage (12, 16).
For loads and stores, the MA needs to send the effective address calculated by the ALU to memory (13). For loads, the memory will then respond by sending the requested data back (15). For stores, we need to send the data to be stored along with the effective address (11, 14), and the memory will respond by updating itself.
Finally, MUX9 will select between (12) and (15) for the input for (16). As per usual, this decision would be based on the instruction type and the control line would come from the ID phase (not shown).
The Write Back phase, finally, is a very simple one. It simply takes the output of the MA phase (..), and sends it back to the write back phase to store the result into the destination register.
For stores (and no-ops), the write-back does nothing.
If you have watched the execution of your programs carefully, you will have noticed that only one stage is active at any given moment. First, the IF stage fetches the instruction. Then, the ID phase decodes the instruction and fetches the operands, while the IF phase is idle. Then the EX phase executes the instruction while both IF and ID are idle. And so forth. Also, the clock period is really long: it must be long enough to allow an instruction to propagate through all 5 stages.
An obvious method to speed up the processor is to make the individual pipeline stages work in parallel. So, when the ID phase is decoding an instruction i, the IF phase would be fetching the next instruction i + 1 in parallel. On the next clock cycle, the EX phase would execute instruction i, the ID phase would decode instruction i + 1 and the IF phase would fetch instruction i + 2. This technique is called pipelining.
For this to work, each phase needs it's own set of registers to hold the instruction and instruction operands. This is necessary because each stage will be working on a different instruction. But inserting registers in between the different phases has a huge advantage: the clock period can now be much shorter. Indeed, we only need to allow for an instruction to make it through one stage of the processor, instead of all 5. So, at least in theory, we can clock the processor 5 times faster. Shown visually:
Cycle IF ID EX MA WB 1 i i i i i 2 i+1 i+1 i+1 i+1 i+1
Cycle IF ID EX MA WB 1 i 2 i+1 i 3 i+2 i+1 i 4 i+3 i+2 i+1 i 5 i+4 i+3 i+2 i+1 i 6 i+5 i+4 i+3 i+2 i+1 7 i+6 i+5 i+4 i+3 i+2 8 i+7 i+6 i+5 i+4 i+3 9 i+8 i+7 i+6 i+5 i+4 10 i+9 i+8 i+7 i+6 i+5
There are two observations to be made here. First of all, instruction i takes 1 (long) clock cycle in the non-pipelined processor, and it takes 5 (short) clock cycles in the pipelined processor. Note that this "short" cycle is 1/5 of the "long" cycle, so it actually takes just as long to execute an instruction in the non-pipelined version as it does in the pipelined version.
The point is, however, that even though it takes just as long to execute one instruction, both processors finish the execution of an instruction after every clock cycle. Instruction i+1 finishes one clock cycle after instruction i for both processors. BUT: the clock cycle for the pipelined processor is a fifth of the clock cycle for the non-pipelined processor. Hence, the throughput, the number of instructions executed per unit time, is 5 times higher for the pipelined processor than it is for the non-pipelined processor. In other words, the pipelined processor is 5 times faster than the non-pipelined processor.
Try it out. To illustrate this point, watch the execution of a single instruction in the non-pipelined and in the pipelined processor. Notice that in the pipelined processor, the clock period is much shorter, and every stage now has its own set of registers. Also note that despite this, the instruction takes the same time to execute in either processor.
Try it out. Then consider the following simple program and watch the execution in both the non-pipelined and the pipelined processor and compare execution speed. You should notice that it takes 6 long cycles to execute the program in the non-pipelined processor, and 10 short cycles to execute the program in the pipelined processor.
addi r1, r0, 4 //
addi r2, r0, 5 //
addi r3, r0, 6 //
slli r1, r1, 2 //
r1<< 2 (
subi r2, r2, 3 //
add r3, r3, r3 //
Of course, it is not as simple as it seems (is anything ever?). The above program was carefully chosen for the pipelining to work. Consider the following code fragment:
add r1, r2, r3 //
sub r3, r1, r2
xor r3, r1, r2
srl r3, r1, r2
ori r3, r1, 10
If we look at the pipeline during the executing of these instructions, we see:
Cycle IF ID EX MA WB 1 add 2 sub add 3 xor sub add 4 srl xor sub add 5 ori srl xor sub add 6 ori srl xor sub
In cycle 3, the
subinstruction reads the value of register
addinstruction, however, whose destination register is
r0, has not reached the write-back phase yet. Hence, the
subwill read the wrong value for
This situation is called a data hazard. There is another data hazard cycle 4 (which instructions?). The
srlin cycle 5 is OK, because the
addwill write the result back to the register file in the first half of the clock cycle, and the
srlwill read its operands in the second half of the clock cycle (two phase memory access). It should be clear that the
oriin cycle 6 is OK too - the
addhas finished execution completely in this cycle.
The easiest solution is to ignore the problem. While that might sound like a bad solution, it is actually quite usable. Consider the above program again, and assume that
r1is initially 1, and
r2is initially 2. Then:
add r0, r1, r2 //
sub r2, r0, r1 //
r2:= -1 (uses old value of
xor r2, r0, r1 //
r2:= 1 (uses old value of
srl r2, r0, r1 //
r2:= 6 (uses new value of
ori r2, r0, 10 //
r2:= 11 (uses new value of
It becomes a bit harder to read the source code, because the processor seems to execute instructions out of order. But as long as this behaviour is well-defined, the programmer (or the compiler) can take it into account and make good use of it. (Note that the processor does not really execute instructions out of order.)
Try it out. Follow the execution of this program in the MIPS pipeline.
A second solution is to stall the pipeline. To explain what a stall is, we need a bit of extra terminology. We can divide the pipeline into two parts, instruction fetch and instruction execute:
Instruction Fetch Instruction Execute IF ID EX MA WB
When an instruction makes it from the first half of this pipeline to the second half, i.e. from the ID phase to the EX phase, it is said to be issued. On a stall, we do not issue the instruction in the ID phase. Instead, we issue a NOP ("do nothing") instruction. In the meantime, the IF and and ID phase will simply re-execute (we shall later see that it is is not quite as simple as that). The above program would be executed as follows:
Cycle IF ID EX MA WB 1 add 2 sub add 3 xor sub add We detect a stall here, and issue a
nopinstead of the
4 xor sub nop add The
addhas not yet reached WB, so we need another stalled cycle
5 xor sub nop nop add The
addwill write its results in the first half of the cycle, so the
xorwill now get the correct value for
6 srl xor sub nop nop
The downside is that with this solution, we need 7 cycles instead of 5 to execute our program. That's a 40% increase in execution time!
Try it out. Run this program in the animation. For clarity, the MIPS animation will issue a "
stall" instruction instead of a
nopinstruction upon a stall. This is only to distinguish
nops issued because of stalls from
nops that were actually in the instruction memory. In reality, the
stallinstruction does not exist.
We want to be able to use the result of the ALU directly, without having to wait for the result to be written back to the register file. For this, we need to forward the result of the ALU directly back into the ALU.
Before the result gets written to the destination register, it is written into the intermediate registers OUT0 and OUT1 first (see diagram). We need to forward the contents of OUT0 and OUT1 back into the input of the ALU (connections marked 1a and 1b connect OUT0 to the ALU, and connections marked 2a and 2b connect OUT1 to the ALU). Of course, we then also need a multiplexer to select the right operand (MUX6 and MUX7).
With this solution, there are no more stalls, and the code is executed the way you would expect from reading the source code. If you don't understand it, try the animation - it should become clear.
Consider the following code fragment.
l r1, 1(r2) // Load the value at address 1+
add r3, r1, r2
How is this executed in the pipeline?
Cycle IF ID EX MA WB 1 lb 2 add lb 3 add lb 4 add lb 5 add lb
lbwill only write the value from memory into the register file in the WB phase, whereas the
addretrieves its operands in the ID phase. Thus, the
addwill actually use an old value of
r1. As before, we could ignore this problem and call it a feature, and that would be fine. We could also stall the pipeline for two cycles:
Cycle IF ID EX MA WB 1 lb 2 add lb 3 add lb We detect a stall, issue a
nopinstead of the
4 add nop lb And another one 5 add nop nop lb
lbwill write its result into OUT1 before it is written to the register file, however, so if we use pipeline forwarding, and we forward the result of OUT1 directly back to the ALU, we can reduce this to a one-cycle stall.
Cycle IF ID EX MA WB 1 lb 2 add lb 3 add lb We detect a stall, issue a
nopinstead of the
4 add nop lb 5 add nop lb
We cannot do better than this, simply because the value is not loaded from memory until the MA phase. Note that the load calculates the effective address of the memory operand in the ALU phase; in the above example, it would calculate 2+
The store instruction exhibits the same data hazards as, for example, arithmetic instructions such as
sub. Consider the following program:
add r1, r2, r3
s 10(r0), r1 memory :=
To solve the hazard, we have the usual three options:
- Copy the result from B (effectively, from the register file) to SMR and ignore any dependencies. In effect, leave the hardware as it was.
- Stall the pipeline. This results in a two-cycle penalty
- Use forwarding. This requires an additional multiplexer (MUX8) and connections from OUT0 and OUT1 (connections marked 3 and 4)
Note that, conceptually, this is not a new problem (it is the exact same data hazard as we saw before). It is treated separately here, however, because we do need a bit of extra hardware to resolve the hardware for stores.
Consider the following code sequence:
add r1, r2, r3
j l1 // Jump to (label)
sub r2, r3, r0
l1: xor r3, r0, r1
ori r1, r2, 10
How this is executed in the MIPS pipeline?
Cycle IF ID EX MA WB 1 add 2 j add 3 sub j add The branch is resolved here, but we already fetched the
4 xor sub j add 5 ori xor sub j add
Notice that we executed the
subinstruction which, logically, should never have executed because of the branch (
j). This happens because the branch target is only calculated in the ID phase. This problem is called a control hazard.
As with load hazards, we can decide to ignore the control hazard. Once we define that the instruction directly after the branch is executed anyway, we can take that into account and make use of it. This technique is called delayed branches, and the instruction following the branch is said to be in the delay slot.
Try it out. Run this program using delayed branches.
Alternatively, we can stall the pipeline. Remember, when we stall the pipeline, the instruction in the ID phase (which would be the branch) is not issued (a
nopis issued instead). We also said that the IF and ID phase simply re-execute, i.e., their registers are not updated.
As it turns out, this is not completely true. If the PC would not be updated, the IF phase would fetch the same instruction again (the
sub), and we would be in the exact same situation. Thus, the PC is updated for a control hazard. The ID register and PC1 are not updated, however, because if they did, the
subinstruction would make it to the ID phase, which is exactly what we want to avoid).
Here is how the program would executed using pipeline interlock.
Cycle IF ID EX MA WB 1 add 2 j add 3 sub j add We detect a stall 4 xor j nop add PC is updated so when the IF re-executes, a different instruction is loaded 5 ori xor j nop add
Try it out. Run this program using branch interlock.
Finally, there is a third option, branch prediction. Branch prediction uses a Branch Target Buffer to predict what the target of the PC will be. This branch target buffer or BTB is an associative cache. Each entry in this cache lists an address of a branch instruction (the tag), together with the prediction target. This predicted target is based on the last time the branch was executed.
When the IF phase executes, it sends the contents of the PC to the BTB (1). The BTB then compares this with the addresses (tags) in its cache. If the address is in the cache, we must have executed the branch before. We then predict that the branch will do the same as the last time, so we send the associated value in the BTB to the PC (2).
If the address is not there, either because we never saw the branch before (compulsory miss), or because we did see the the branch before, but too long ago and it has been replaced by another branch in the BTB (capacity miss or conflict miss), we cannot predict the branch and we simply calculate
PC+4(as if we did not have branch prediction, this is also called predict-not-taken, because we "predict" that we do not take the branch and simply want to execute the next instruction in memory).
Of course, the prediction could be wrong. We cannot determine this until the ID phase, however. Remember, in the IF phase, we do not know what kind of instruction we are executing - this is only determined in the ID phase. Therefore, in the ID phase, we calculate the branch target anyway. We then compare the calculated branch target with the predicted PC (by comparing it to the
PCregister). If the prediction is correct, all is well, but if it is wrong (or if we did not predict the target address because of a BTB miss), we need to stall the pipeline and update the BTB (connections 3, 4 and 5).
So there are 3 different situations:
- The branch is not in the BTB
We update the PC with PC+4. When the branch gets to the ID phase, we stall the pipeline to update the BTB.
- The branch is in the BTB, but the prediction is wrong
We update the PC with the predicted PC. When the branch gets to the ID phase, we stall the pipeline to update the BTB.
- The branch is in the BTB and the prediction is correct
We update the PC with the predicted PC. When the branch gets to the ID phase, no action is taken.
Try it out. See a demonstration of branch prediction. Notice that all three situations listed above occur when the program is executed.
The MIPS processor supports two conditional branches:
beqz(branch if equal to zero) and
bnez(branch if not equal to zero). Both these instructions take an arbitrary register as a source operand. This yields another data hazard:
add r1, r2, r3
bneq l1, r1
sub r2, r3, r1
l1: xor r3, r2, r1
ori r3, r1, 10
Consider the execution of this program in the pipeline:
Cycle IF ID EX MA WB 1 add 2 bneq add We could use a BTB here, or predict-not-taken 3 sub bneq add Now we need to check if our prediction was correct
The problems start in cycle 3. We need to check register
r1to see if it is not equal to zero. But the
addinstruction is still in the EX phase: the value for
r1is still being calculated! As for the other hazards that we have seen, we can ignore this problem, and call it a feature.
We call also stall the pipeline, but we would have to stall it for two cycles (we would have to wait until the
addgets to the WB phase). Even then, we would only have the result towards the end of the cycle (the register file is updated halfway through the cycle, and we still have to do the comparison against zero), but that is acceptable. That is quite a long stall, however, and this situation is very common (regard the
addas a compare instruction and you'll understand why).
Finally, we could try to forward the result, as we did for data hazards. That means, we would have to forward the contents of OUT0 (connection c) and OUT1 (d) to the comparator. But that is not enough! We also need to forward the result of the ALU as it is being calculated to the comparator (b). Remember, this result will only be written to OUT0 on the next clock cycle - which is too late. This adds an asynchronous (pronounce: dodgy) element to the element, but it is necessary to avoid a stall in this - very common - situation.
Summary of Pipeline Stalls
The following table lists all possible stalls exhaustively, with the conditions for the stall. For each individual stall, you can run an animation that will show a program in which the stall occurs. Note that only for the control hazards, the PC register is updated during the stall. For all other stalls, the PC register is not updated (see section on control hazards for more information).
The sample programs for each of these hazards allow to change the circuit configuration and the contents of the instruction memory, if you wish. This allows you to try and remove the hazard by changing the program or the circuit configuration.
Hazard IF ID EX MA WB Condition Demo RAW i1 i2 .., Rx, .. i3 Rx, .., .. i4 i5 No ALU forwarding Run i1 i2 .., Rx, .. i3 i4 Rx, .., .. i5 No ALU forwarding Run RAW i1 sb .., Rx i3 Rx, .., .. i4 i5 No forwarding to SMR Run (store) i1 sb .., Rx i3 i4 Rx, .., .. i5 No forwarding to SMR Run RAW i1 beqz Rx i3 Rx, .., .. i4 i5 No zero-forwarding Run (branch) i1 beqz Rx i3 i4 Rx, .., .. i5 No zero-forwarding Run i1 jr Rx i3 Rx, .., .. i4 i5 (Always) Run i1 jr Rx i3 i4 Rx, .., .. i5 (Always) Run Load i1 i2 .., Rx, .. lb Rx, .. i4 i5 Load interlock Run Control i1 j L1 i3 i4 i5 Calculated PC ≠
Run i1 jr Rx i3 i4 i5 Calculated PC ≠
Run i1 beqz Rx, L1 i3 i4 i5 Calculated PC ≠
Question 1. Why is the MA phase idle for all instructions except loads and stores?
Question 2. In the section introduction of pipelining, the same program is executed in the non-piplined and in the pipelined processor. Confirm that the speedup is 300%, and explain why this is less than the expected 500%.
Question 3. Draw a pipeline diagram to show the hazard mentioned in the section on forwarding to the SMDR. Then write a program to show this hazard, and observe the effects of the different circuit configurations (Store operand forwarding, Store interlock, and No store interlock).
Question 4. The table with all pipeline stalls lists two situations that always cause a stall. What extra hardware would you need to avoid these stalls?
Question 5. The diagram shows four control lines for multiplexer MUX3 (see the section on conditional branches). Are all of these four lines strictly necessary?
Question 6. There are quite a number of stalls during the execution shift-and-add multiplier. Rewrite the program to avoid nearly all of these stalls.
last updated: 19 May 2005