Introduction

This tutorial makes use of the Vivio animation of a DLX/MIPS processor.

The processor we will be considering in this tutorial is the MIPS processor. The MIPS processor, designed in 1984 by researchers at Stanford University, is a RISC (Reduced Instruction Set Computer) processor. Compared with their CISC (Complex Instruction Set Computer) counterparts (such as the Intel Pentium processors), RISC processors typically support fewer and much simpler instructions.

The premise is, however, that a RISC processor can be made much faster than a CISC processor because of its simpler design. These days, it is generally accepted that RISC processors are more efficient than CISC processors; and even the only popular CISC processor that is still around (Intel Pentium) internally translates the CISC instructions into RISC instructions before they are executed [1].

RISC processors typically have a load-store architecture. This means there are two instructions for accessing memory: a load (l) instruction to load data from memory and a store (s) instruction to write data to memory. It also means that none of the other instructions can access memory directly. So, an instruction like "add this byte from memory to register 1" from a CISC instruction set would need two instructions in a load-store architecture: "load this byte from memory into register 2" and "add register 2 to register 1".

CISC processors also offer many different addressing modes. Consider the following instruction from the Intel 80x86 instruction set (with simplified register names):

add r1, [r2+r3*4+60] // i86 (not MIPS) example

This instruction loads a value from memory and adds it to a register. The memory location is given in between the square brackets. As you can see, the Intel instruction set, as is typical for CISC architectures, allows for very complicated expressions for address calculations or "addressing modes". The MIPS processor, on the other hand, only allows for one, rather simple, addressing mode: to specify an address, you can specify a constant and a register. So, the above Intel instruction could be translated as:

slli r3, r3, 2  // r3 := r3 << 2 (i.e. r3 := r3 * 4)
add  r2, r2, r3 // r2 := r2 + r3
l    r4, 60(r2) // r4 := memory[60+r4]
add  r1, r1, r4 // r1 := r1 + r4

We need four instructions instead of one, and an extra register (r4) to do what can be done with one instruction in a CISC architecture. The internal circuitry of the RISC processor is much simpler, however, and can thus be made very fast. How this is done, is the topic of this tutorial.

Basic Processor Architecture

The execution of an instruction in a processor can be split up into a number of stages. How many stages there are, and the purpose of each stage is different for each processor design. Examples includes 2 stages (Instruction Fetch / Instruction Execute) and 3 stages (Instruction Fetch, Instruction Decode, Instruction Execute). The MIPS processor has 5 stages:

IF The Instruction Fetch stage fetches the next instruction from memory using the address in the PC (Program Counter) register and stores this instruction in the IR (Instruction Register)
ID The Instruction Decode stage decodes the instruction in the IR, calculates the next PC, and reads any operands required from the register file.
EX The Execute stage "executes" the instruction. In fact, all ALU operations are done in this stage. (The ALU is the Arithmetic and Logic Unit and performs operations such as addition, subtraction, shifts left and right, etc.)
MA The Memory Access stage performs any memory access required by the current instruction, So, for loads, it would load an operand from memory. For stores, it would store an operand into memory. For all other instructions, it would do nothing.
WB For instructions that have a result (a destination register), the Write Back writes this result back to the register file. Note that this includes nearly all instructions, except nops (a nop, no-op or no-operation instruction simply does nothing) and s (stores).

Consider the execution of the following program and try to consider the operation of each of the 5 processor stages (IF, ID, EX, MA and WB) for each individual instruction. (Note that register r0 in the MIPS processor is always 0).

l   r1, 0(r0)  // r1 := memory[0]
l   r2, 1(r0)  // r2 := memory[1]
add r3, r1, r2 // r3 := r1 + r2
s   2(r0), r3  // memory[2] := r3

Try it out. Follow the execution of this program in the MIPS animation and check if you were right. If you don't understand each and every bit of the diagram yet, don't despair - each stage will be explained separately and in detail in the next section.

Details of the Processor Stages

As I said before, the Instruction Fetch phase fetches the next instruction. First, it sends the contents of the PC register, which contains the address for the next instruction, to the instruction memory (1). The instruction memory will then respond by sending the correct instruction. This instruction is sent on to the next (instruction decode) phase (2).

The instruction decode phase will calculate the next PC and will send it back to the IF phase (4) so that the IF phase knows which instruction to fetch next. To be able to execute a jr instruction (which changes the PC to the contents of a register), we also need a connection from the register file to the PC (5).

One of these (4 or 5) is then selected (MUX1) to update the PC on the next rising clock edge. The control line for MUX1 (not shown on the diagram) would also come from the ID phase, and is based on instruction type.

The Instruction Decode phase then has two main tasks: calculate the next PC and fetch the operands for the current instruction. There are three possibilities for the next PC.

  1. For the most common case, for all "normal" (meaning instructions that are not branches or jumps), we must simply calculate PC+4 to get the address of the next instruction (3, ADD4).
  2. For jumps and branches (if the branch is taken), we might also have to add some immediate value (the branch offset) to the PC (ADDi). This branch offset is encoded in the instruction itself (6).
  3. For the jr and jalr instructions, we need to use the value of a register instead.

MUX3 selects between the first two posibilities (+4 or +immediate), and MUX1 (in the IF phase) selects between MUX3 and the third posibility (value from the register file). The control line for MUX3 comes from a zero-detector (not shown), which in turn gets its input from the register file (thin line marked "a"). The control line for MUX1 is not shown and is based on the instruction type.

Then, the second main task is to fetch the operands for the next instruction. These operands come from the register file for all but two instructions (7, 8). The two instructions that are the exception are jal and jalr. These two jump instructions save the address of the next instruction in a destination register, so instead of sending an operand from the register file, we need to send the contents of the PC+4 (10).

MUX5 selects between (7) and (10). Again, the control line for this MUX (not shown) would also come from the ID phase, and is be based on the instruction type.

The Execution phase "executes" the instruction. Any calculations necessary are done in this phase. These calculations are all done by the ALU (Arithmetic and Logic Unit). The internals of these unit is beyond the scope of this tutorial, but you should find an explanation of this unit in any elementary book on digital logic, see for example [2].

Conceptually this stage is very simple. The ALU needs two operands. These either come from the ID phase (8, 9) and thus in turn from the register file (or PC+4 for jump and link instructions), or one operand comes from the ID phase (8) and the other comes from the instruction register (6) to supply an immediate value.

MUX7 selects between (9) and (6). The control line for this MUX is not shown, but would come from the ID phase and is based on the instruction type (it will select (9) for instructions that operate on two registers, and (6) for instructions that work one one register and one immediate value).

(11) is used for the s (store instruction). Consider the following instruction:

s 10(r0), r2 // memory[10+r0] := r2

The calculation 10+r0 will be done by the ALU (so which source does MUX7 select?), and the result of this calculation will be passed on to the MA stage (12). In the MA phase, however, we will also need to know the contents of r2. This is why we need (11).

The purpose of the Memory access phase is to store operands into memory or load operands from memory. So, for all instructions except loads and stores, the MA simply passes on the value from the ALU on to the WB stage (12, 16).

For loads and stores, the MA needs to send the effective address calculated by the ALU to memory (13). For loads, the memory will then respond by sending the requested data back (15). For stores, we need to send the data to be stored along with the effective address (11, 14), and the memory will respond by updating itself.

Finally, MUX9 will select between (12) and (15) for the input for (16). As per usual, this decision would be based on the instruction type and the control line would come from the ID phase (not shown).

The Write Back phase, finally, is a very simple one. It simply takes the output of the MA phase (..), and sends it back to the write back phase to store the result into the destination register.

For stores (and no-ops), the write-back does nothing.

Try it out. Watch the execution of example 1, example 2 and example 3 and see if you can understand the operation of each processor stage.


Pipelining the Processor

If you have watched the execution of your programs carefully, you will have noticed that only one stage is active at any given moment. First, the IF stage fetches the instruction. Then, the ID phase decodes the instruction and fetches the operands, while the IF phase is idle. Then the EX phase executes the instruction while both IF and ID are idle. And so forth. Also, the clock period is really long: it must be long enough to allow an instruction to propagate through all 5 stages.

An obvious method to speed up the processor is to make the individual pipeline stages work in parallel. So, when the ID phase is decoding an instruction i, the IF phase would be fetching the next instruction i + 1 in parallel. On the next clock cycle, the EX phase would execute instruction i, the ID phase would decode instruction i + 1 and the IF phase would fetch instruction i + 2. This technique is called pipelining.

For this to work, each phase needs it's own set of registers to hold the instruction and instruction operands. This is necessary because each stage will be working on a different instruction. But inserting registers in between the different phases has a huge advantage: the clock period can now be much shorter. Indeed, we only need to allow for an instruction to make it through one stage of the processor, instead of all 5. So, at least in theory, we can clock the processor 5 times faster. Shown visually:

No pipelining

With pipelining

Cycle IF ID EX MA WB
1 i
i
i
i
i
2 i+1
i+1
i+1
i+1
i+1
Cycle IF ID EX MA WB
1 i
2 i+1 i
3 i+2 i+1 i
4 i+3 i+2 i+1 i
5 i+4 i+3 i+2 i+1 i
6 i+5 i+4 i+3 i+2 i+1
7 i+6 i+5 i+4 i+3 i+2
8 i+7 i+6 i+5 i+4 i+3
9 i+8 i+7 i+6 i+5 i+4
10 i+9 i+8 i+7 i+6 i+5

There are two observations to be made here. First of all, instruction i takes 1 (long) clock cycle in the non-pipelined processor, and it takes 5 (short) clock cycles in the pipelined processor. Note that this "short" cycle is 1/5 of the "long" cycle, so it actually takes just as long to execute an instruction in the non-pipelined version as it does in the pipelined version.

The point is, however, that even though it takes just as long to execute one instruction, both processors finish the execution of an instruction after every clock cycle. Instruction i+1 finishes one clock cycle after instruction i for both processors. BUT: the clock cycle for the pipelined processor is a fifth of the clock cycle for the non-pipelined processor. Hence, the throughput, the number of instructions executed per unit time, is 5 times higher for the pipelined processor than it is for the non-pipelined processor. In other words, the pipelined processor is 5 times faster than the non-pipelined processor.

Try it out. To illustrate this point, watch the execution of a single instruction in the non-pipelined and in the pipelined processor. Notice that in the pipelined processor, the clock period is much shorter, and every stage now has its own set of registers. Also note that despite this, the instruction takes the same time to execute in either processor.

Try it out. Then consider the following simple program and watch the execution in both the non-pipelined and the pipelined processor and compare execution speed. You should notice that it takes 6 long cycles to execute the program in the non-pipelined processor, and 10 short cycles to execute the program in the pipelined processor.

addi r1, r0, 4  // r1 := 4
addi r2, r0, 5  // r2 := 5
addi r3, r0, 6  // r3 := 6
slli r1, r1, 2  // r1 := r1 << 2 (r1 := r1 * 4)
subi r2, r2, 3  // r2 := r2 - 3
add  r3, r3, r3 // r3 := r3 + r3

Data Hazards

Of course, it is not as simple as it seems (is anything ever?). The above program was carefully chosen for the pipelining to work. Consider the following code fragment:

add r1, r2, r3 // r1 := r2 + r3
sub r3, r1, r2
xor r3, r1, r2
srl r3, r1, r2
ori r3, r1, 10

If we look at the pipeline during the executing of these instructions, we see:

Cycle IF ID EX MA WB
1 add
2 sub add
3 xor sub add
4 srl xor sub add
5 ori srl xor sub add
6 ori srl xor sub

In cycle 3, the sub instruction reads the value of register r0. The add instruction, however, whose destination register is r0, has not reached the write-back phase yet. Hence, the sub will read the wrong value for r0!

This situation is called a data hazard. There is another data hazard cycle 4 (which instructions?). The srl in cycle 5 is OK, because the add will write the result back to the register file in the first half of the clock cycle, and the srl will read its operands in the second half of the clock cycle (two phase memory access). It should be clear that the ori in cycle 6 is OK too - the add has finished execution completely in this cycle.

The easiest solution is to ignore the problem. While that might sound like a bad solution, it is actually quite usable. Consider the above program again, and assume that r1 is initially 1, and r2 is initially 2. Then:

add r0, r1, r2 // r0 := 3
sub r2, r0, r1 // r2 := -1 (uses old value of r0)
xor r2, r0, r1 // r2 := 1 (uses old value of r0)
srl r2, r0, r1 // r2 := 6 (uses new value of r0)
ori r2, r0, 10 // r2 := 11 (uses new value of r0)

It becomes a bit harder to read the source code, because the processor seems to execute instructions out of order. But as long as this behaviour is well-defined, the programmer (or the compiler) can take it into account and make good use of it. (Note that the processor does not really execute instructions out of order.)

Try it out. Follow the execution of this program in the MIPS pipeline.

Pipeline Interlocks

A second solution is to stall the pipeline. To explain what a stall is, we need a bit of extra terminology. We can divide the pipeline into two parts, instruction fetch and instruction execute:

Instruction Fetch Instruction Execute
IF ID EX MA WB

When an instruction makes it from the first half of this pipeline to the second half, i.e. from the ID phase to the EX phase, it is said to be issued. On a stall, we do not issue the instruction in the ID phase. Instead, we issue a NOP ("do nothing") instruction. In the meantime, the IF and and ID phase will simply re-execute (we shall later see that it is is not quite as simple as that). The above program would be executed as follows:

Cycle IF ID EX MA WB
1 add
2 sub add
3 xor sub add We detect a stall here, and issue a nop instead of the sub
4 xor sub nop add The add has not yet reached WB, so we need another stalled cycle
5 xor sub nop nop add The add will write its results in the first half of the cycle, so the xor will now get the correct value for r0
6 srl xor sub nop nop

The downside is that with this solution, we need 7 cycles instead of 5 to execute our program. That's a 40% increase in execution time!

Try it out. Run this program in the animation. For clarity, the MIPS animation will issue a "stall" instruction instead of a nop instruction upon a stall. This is only to distinguish nops issued because of stalls from nops that were actually in the instruction memory. In reality, the stall instruction does not exist.

ALU Forwarding

We want to be able to use the result of the ALU directly, without having to wait for the result to be written back to the register file. For this, we need to forward the result of the ALU directly back into the ALU.

Before the result gets written to the destination register, it is written into the intermediate registers OUT0 and OUT1 first (see diagram). We need to forward the contents of OUT0 and OUT1 back into the input of the ALU (connections marked 1a and 1b connect OUT0 to the ALU, and connections marked 2a and 2b connect OUT1 to the ALU). Of course, we then also need a multiplexer to select the right operand (MUX6 and MUX7).

With this solution, there are no more stalls, and the code is executed the way you would expect from reading the source code. If you don't understand it, try the animation - it should become clear.

Load Hazards

Consider the following code fragment.

l   r1, 1(r2) // Load the value at address 1+r2 into r1
add r3, r1, r2

How is this executed in the pipeline?

Cycle IF ID EX MA WB
1 lb
2 add lb
3 add lb
4 add lb
5 add lb

Again, the lb will only write the value from memory into the register file in the WB phase, whereas the add retrieves its operands in the ID phase. Thus, the add will actually use an old value of r1. As before, we could ignore this problem and call it a feature, and that would be fine. We could also stall the pipeline for two cycles:

Cycle IF ID EX MA WB
1 lb
2 add lb
3 add lb We detect a stall, issue a nopinstead of the add
4 add nop lb And another one
5 add nop nop lb

The lb will write its result into OUT1 before it is written to the register file, however, so if we use pipeline forwarding, and we forward the result of OUT1 directly back to the ALU, we can reduce this to a one-cycle stall.

Cycle IF ID EX MA WB
1 lb
2 add lb
3 add lb We detect a stall, issue a nop instead of the add
4 add nop lb
5 add nop lb

We cannot do better than this, simply because the value is not loaded from memory until the MA phase. Note that the load calculates the effective address of the memory operand in the ALU phase; in the above example, it would calculate 2+r2.

Try it out. Consider each solution (ignoring the problem, interlock and interlock with forwarding) in the MIPS pipeline.

Stores

The store instruction exhibits the same data hazards as, for example, arithmetic instructions such as add or sub. Consider the following program:

add r1, r2, r3
s   10(r0), r1 memory[10] :=r1

To solve the hazard, we have the usual three options:

  1. Copy the result from B (effectively, from the register file) to SMR and ignore any dependencies. In effect, leave the hardware as it was.
  2. Stall the pipeline. This results in a two-cycle penalty
  3. Use forwarding. This requires an additional multiplexer (MUX8) and connections from OUT0 and OUT1 (connections marked 3 and 4)

Note that, conceptually, this is not a new problem (it is the exact same data hazard as we saw before). It is treated separately here, however, because we do need a bit of extra hardware to resolve the hardware for stores.

Control Hazards

Consider the following code sequence:

    add r1, r2, r3
    j   l1         // Jump to (label) l1
    sub r2, r3, r0

l1: xor r3, r0, r1
    ori r1, r2, 10

How this is executed in the MIPS pipeline?

Cycle IF ID EX MA WB
1 add
2 j add
3 sub j add The branch is resolved here, but we already fetched the sub
4 xor sub j add
5 ori xor sub j add

Notice that we executed the sub instruction which, logically, should never have executed because of the branch (j). This happens because the branch target is only calculated in the ID phase. This problem is called a control hazard.

As with load hazards, we can decide to ignore the control hazard. Once we define that the instruction directly after the branch is executed anyway, we can take that into account and make use of it. This technique is called delayed branches, and the instruction following the branch is said to be in the delay slot.

Try it out. Run this program using delayed branches.

Alternatively, we can stall the pipeline. Remember, when we stall the pipeline, the instruction in the ID phase (which would be the branch) is not issued (a nop is issued instead). We also said that the IF and ID phase simply re-execute, i.e., their registers are not updated.

As it turns out, this is not completely true. If the PC would not be updated, the IF phase would fetch the same instruction again (the sub), and we would be in the exact same situation. Thus, the PC is updated for a control hazard. The ID register and PC1 are not updated, however, because if they did, the sub instruction would make it to the ID phase, which is exactly what we want to avoid).

Here is how the program would executed using pipeline interlock.

Cycle IF ID EX MA WB
1 add
2 j add
3 sub j add We detect a stall
4 xor j nop add PC is updated so when the IF re-executes, a different instruction is loaded
5 ori xor j nop add

Try it out. Run this program using branch interlock.

Branch Prediction

Finally, there is a third option, branch prediction. Branch prediction uses a Branch Target Buffer to predict what the target of the PC will be. This branch target buffer or BTB is an associative cache. Each entry in this cache lists an address of a branch instruction (the tag), together with the prediction target. This predicted target is based on the last time the branch was executed.

When the IF phase executes, it sends the contents of the PC to the BTB (1). The BTB then compares this with the addresses (tags) in its cache. If the address is in the cache, we must have executed the branch before. We then predict that the branch will do the same as the last time, so we send the associated value in the BTB to the PC (2).

If the address is not there, either because we never saw the branch before (compulsory miss), or because we did see the the branch before, but too long ago and it has been replaced by another branch in the BTB (capacity miss or conflict miss), we cannot predict the branch and we simply calculate PC+4 (as if we did not have branch prediction, this is also called predict-not-taken, because we "predict" that we do not take the branch and simply want to execute the next instruction in memory).

Of course, the prediction could be wrong. We cannot determine this until the ID phase, however. Remember, in the IF phase, we do not know what kind of instruction we are executing - this is only determined in the ID phase. Therefore, in the ID phase, we calculate the branch target anyway. We then compare the calculated branch target with the predicted PC (by comparing it to the PC register). If the prediction is correct, all is well, but if it is wrong (or if we did not predict the target address because of a BTB miss), we need to stall the pipeline and update the BTB (connections 3, 4 and 5).

So there are 3 different situations:

  1. The branch is not in the BTB
    We update the PC with PC+4. When the branch gets to the ID phase, we stall the pipeline to update the BTB.
  2. The branch is in the BTB, but the prediction is wrong
    We update the PC with the predicted PC. When the branch gets to the ID phase, we stall the pipeline to update the BTB.
  3. The branch is in the BTB and the prediction is correct
    We update the PC with the predicted PC. When the branch gets to the ID phase, no action is taken.

Try it out. See a demonstration of branch prediction. Notice that all three situations listed above occur when the program is executed.

Conditional Branches

The MIPS processor supports two conditional branches: beqz (branch if equal to zero) and bnez (branch if not equal to zero). Both these instructions take an arbitrary register as a source operand. This yields another data hazard:

    add  r1, r2, r3
    bneq l1, r1
    sub  r2, r3, r1

l1: xor r3, r2, r1
    ori r3, r1, 10

Consider the execution of this program in the pipeline:

Cycle IF ID EX MA WB
1 add
2 bneq add We could use a BTB here, or predict-not-taken
3 sub bneq add Now we need to check if our prediction was correct

The problems start in cycle 3. We need to check register r1 to see if it is not equal to zero. But the add instruction is still in the EX phase: the value for r1 is still being calculated! As for the other hazards that we have seen, we can ignore this problem, and call it a feature.

We call also stall the pipeline, but we would have to stall it for two cycles (we would have to wait until the add gets to the WB phase). Even then, we would only have the result towards the end of the cycle (the register file is updated halfway through the cycle, and we still have to do the comparison against zero), but that is acceptable. That is quite a long stall, however, and this situation is very common (regard the add as a compare instruction and you'll understand why).

Finally, we could try to forward the result, as we did for data hazards. That means, we would have to forward the contents of OUT0 (connection c) and OUT1 (d) to the comparator. But that is not enough! We also need to forward the result of the ALU as it is being calculated to the comparator (b). Remember, this result will only be written to OUT0 on the next clock cycle - which is too late. This adds an asynchronous (pronounce: dodgy) element to the element, but it is necessary to avoid a stall in this - very common - situation.

Try it out. See all three possible solutions (no zero forwarding, zero interlock and zero forwarding) in the animation.

Summary of Pipeline Stalls

The following table lists all possible stalls exhaustively, with the conditions for the stall. For each individual stall, you can run an animation that will show a program in which the stall occurs. Note that only for the control hazards, the PC register is updated during the stall. For all other stalls, the PC register is not updated (see section on control hazards for more information).

The sample programs for each of these hazards allow to change the circuit configuration and the contents of the instruction memory, if you wish. This allows you to try and remove the hazard by changing the program or the circuit configuration.

Hazard IF ID EX MA WB Condition Demo
RAW i1 i2 .., Rx, .. i3 Rx, .., .. i4 i5 No ALU forwarding Run
i1 i2 .., Rx, .. i3 i4 Rx, .., .. i5 No ALU forwarding Run
RAW i1 sb .., Rx i3 Rx, .., .. i4 i5 No forwarding to SMR Run
(store) i1 sb .., Rx i3 i4 Rx, .., .. i5 No forwarding to SMR Run
RAW i1 beqz Rx i3 Rx, .., .. i4 i5 No zero-forwarding Run
(branch) i1 beqz Rx i3 i4 Rx, .., .. i5 No zero-forwarding Run
i1 jr Rx i3 Rx, .., .. i4 i5 (Always) Run
i1 jr Rx i3 i4 Rx, .., .. i5 (Always) Run
Load i1 i2 .., Rx, .. lb Rx, .. i4 i5 Load interlock Run
Control i1 j L1 i3 i4 i5 Calculated PC ≠ pc Run
i1 jr Rx i3 i4 i5 Calculated PC ≠ pc Run
i1 beqz Rx, L1 i3 i4 i5 Calculated PC ≠ pc Run

Questions

Question 1. Why is the MA phase idle for all instructions except loads and stores?

Question 2. In the section introduction of pipelining, the same program is executed in the non-piplined and in the pipelined processor. Confirm that the speedup is 300%, and explain why this is less than the expected 500%.

Question 3. Draw a pipeline diagram to show the hazard mentioned in the section on forwarding to the SMDR. Then write a program to show this hazard, and observe the effects of the different circuit configurations (Store operand forwarding, Store interlock, and No store interlock).

Question 4. The table with all pipeline stalls lists two situations that always cause a stall. What extra hardware would you need to avoid these stalls?

Question 5. The diagram shows four control lines for multiplexer MUX3 (see the section on conditional branches). Are all of these four lines strictly necessary?

Question 6. There are quite a number of stalls during the execution shift-and-add multiplier. Rewrite the program to avoid nearly all of these stalls.

Bibliography

[1] Computer Architecture: A Quantitative Approach, 3rd edition, John L. Hennessy & David A. Patterson, Morgan Kaufmann Publishers [now Elsevier]
[2] Logic and Computer Design Fundamentals, 2nd edition, M. Morris Mano and Charles R. Kime, Prentice Hall