Engineers have constantly
searched for ways to improve processor performance. One way to remove
performance bottlenecks is to make the processor do more in one clock
cycle. This has been accomplished through pipelined processors,
superscalar processors, hyper-pipelined processors, dynamic execution,
and the Explicitly Parallel Instruction Computing (EPIC) model.
1. Pipelined Processors
In April 1989, Intel introduced the i486 processor, which was the first pipelined processor.
A pipelined processor
does not wait for one instruction to be completed before it begins
another. Instead, when the prefetch unit completes the first stage and
sends an instruction to the decode unit, it immediately starts working
on the next instruction. As soon as the decode unit finishes one
instruction, it immediately begins working on the next, and so on. This
is illustrated in Figure 1.
2. Superscalar Processors
The Pentium
processor was introduced in 1993. The Pentium added another execution
unit to the processor. This innovation also allowed the Pentium to have
two pipelines. When both pipelines were operating simultaneously, the
processor could execute two instructions during one clock cycle. This is
also known as having “two instructions in flight,” and is illustrated
in Figure 2.
A processor that can execute more than one instruction per clock cycle
is a superscalar processor. Pentium II and Pentium III processors are
superscalar processors, each having five execution units.
3. Hyper-Pipelined Processors
Adding execution
units is one method of increasing the number of instructions that a
processor deals with in one clock cycle. Expanding the number of steps,
or stages, in the pipeline is another method. The Pentium processor has a
5-stage pipeline, the Pentium Pro processor has a 12-stage pipeline,
and the Pentium 4 processor, introduced in 1999, has a 20-stage
pipeline, as shown in Figure 3.
Increasing the number of
stages in a pipeline might at first seem illogical. An instruction that
previously took only 5 cycles to complete now might take as many as 20
cycles. One benefit, however, is that by breaking the work into smaller
steps, the processor has to do less work in each step. The system clock
speed can be increased significantly, and more instructions can be
completed in the same amount of time.
Example
A
hyper-pipelined processor can complete 11 instructions in 20 cycles.
Without the pipeline, the processor could complete only two instructions
in 20 cycles, making a hyper-pipelined processor 5.5 times faster.
4. Dynamic Execution
In 1995, Intel introduced the
Pentium Pro, the first processor based on the P6 architecture. Part of
the P6 architecture is a set of technologies known collectively as dynamic execution.
Dynamic execution consists
of out-of-order execution, branch prediction, and speculative execution.
These technologies are designed to help overcome pipeline
hazards—performance problems in hyper-pipelined processors that can be
caused by pipelining in some situations.
A pipeline hazard blocks the pipeline until the hazard is removed. This is called a pipeline stall.
The deeper the pipeline, the more costly (in performance terms) a
pipeline stall can be. A 20-stage pipeline might have to wait as many as
19 cycles to process the correct branch. A 10-stage pipeline would have
to wait only 9 cycles in the same type of situation. This concept is
illustrated in Figure 4.
Two types of common pipeline hazards are dependence problems and branching problems.
Dependence problems are handled by a processor function known as out-of-order execution. Branching problems are handled by a processor function known as branch prediction. A final processor function that helps resolve pipeline hazards is known as speculative execution. Each of these functions is explained in the following subsections.
4.1. OUT-OF-ORDER EXECUTION
Dependence
problems occur when one instruction needs data from another instruction
to complete its operation. If the first instruction is not complete
before the second instruction needs the data, the pipeline must wait
until the first instruction is complete.
To handle
dependence problems, processors perform out-of-order execution. Instead
of holding up the pipeline when it is waiting for the first instruction
to finish, the processor can process other instructions that do not
depend on the first instruction. The results of these instructions are
stored until they are needed later. When the first instruction has
finished processing, the dependent instruction is introduced into the
pipeline. The result is that fewer cycles are wasted.
Example
Here is an example of out-of-order execution. Suppose a processor needs to execute the following three instructions:
Instruction 1:
8 + X = Y
Instruction 2:
Y + 2 = Z
Instruction 3:
A + 2 = B
Notice
that Instruction 2 depends on Instruction 1. A processor that can
perform out-of-order execution could introduce Instruction 3 (which has
no dependencies on previous instructions) into the pipeline before
Instruction 2. When Instruction 1 is complete (and Instruction 3 is
already far down the pipeline), Instruction 2 is introduced into the
pipeline, as illustrated in Figure 5.
4.2. BRANCH PREDICTION
Branching instructions
check a condition, and then execute one branch of instructions if the
condition is true, or another branch of instructions if the condition is
false.
The branching
instruction might take up to 20 cycles to complete, but the prefetch
unit keeps filling the pipeline. It could send instructions from the
wrong branch into the pipeline before the branching instruction is
complete.
When this happens,
all the instructions in the pipeline are stopped until the instructions
from the wrong branch are flushed from the pipeline, as illustrated in Figure 6.
Branch
prediction helps mitigate branching problems. During branch prediction,
the first time a branch instruction is executed, its address and that of
the next instruction executed (the correct branch) are stored in the branch target buffer
(BTB). The processor uses this information to predict which way the
instruction will branch the next time the processor executes it. Branch
prediction is correct more than 90% of the time.
When the first
instruction is used again, the processor prefetches the branch that was
correct the last time, so the wait is minimized and incorrect guesses
need to be flushed less frequently. When branch prediction is correct,
executing a branch does not cause a pipeline stall.
4.3. SPECULATIVE EXECUTION
Speculative execution
is the third component of dynamic execution. It is a blend of branch
prediction and out-of-order execution.
A processor that can
perform speculative execution will predict the path that a branching
instruction will take and begin executing nonconditional instructions in
that path. (This is different from branch prediction, which only
prefetches the instructions, but does not execute them.)
Instead of
writing the instruction results to the general purpose registers as it
normally would, the processor stores these speculative results in
temporary registers. When the branching instruction is resolved, the
precalculated results can be taken from the temporary registers and
written to the general registers in the order they were issued.
5. EPIC
In May 2001, Intel introduced the Itanium processor, which is based on the Explicitly Parallel Instruction Computing (EPIC) model.
In the EPIC model, a compiler bundles three groups of instructions together. Each bundle includes a few bits, called a template,
that tell the execution units whether the instructions are dependent or
independent. Intel calls this explicitly parallel because the compiler
tells the execution units which instructions can be processed in
parallel. The Itanium processor does not have to determine the
processing order the way a processor with out-of-order capabilities
must.
EPIC
architecture also handles branching in a new way. Each instruction has a
1-bit predicate. The predicates for the instructions in one branch are
set to true; the predicates for the instructions in the other branch are
set to false.
A
branching instruction is sent to the processor followed quickly by both
branches. The processor determines whether the branching instruction is
true or false. If it is true, only the instructions with a true
predicate are processed. If the statement is false, only instructions
with a false predicate are processed. There is no need to flush the
pipeline for mispredicted branches.