A.R.Hurson Computer Science Department Missouri Science & Technology hurson@mst.edu

► Why did I ask some questions at CDC6600 and 7600?

Earlier notion of Super scalar processor discussed.

What is Instruction Level Parallelism?



►ILP can be exploited in two large separable ways:

- Dynamic approach where mainly hardwork locates the parallelism,
- Static approach that largely relies on softwork to locate parallelism.



Straight line code blocks are between f to seven instructions that are norm dependent on each other — degree parallelism within a code block is limited

Several studies have shown that aver parallelism within a basic block ran exceeds 3 or 4.



The presence of dependence indicates potential for a hazard, but actual hazard the length of any stalls is a property of pipeline.

In general, data dependence indicates:
 The possibility of a hazard,
 The order in which results must be calculated,
 An upper bound on how much parallelism ca possibly exploited.



There is also a chance that some of instructions in a basic building block data dependent on each other.



Therefore, to obtain substan performance gains we must exploit across multiple basic blocks.

Simplest and most common way increase parallelism is to exp parallelism among loop iterations loop level parallelism.

- Data Dependency: If an instruction uses a produced by a previous instruction, then the se instruction has a data dependency on the instruction.
- Data dependence limits the performance of a s pipelined processor. The limitation of data depend is even more severe in a super scalar than a s processor. In this case, even longer operat latencies degrade the effectiveness of super s processor drastically.

#### Data dependency





#### Control Dependence

- ➤As in traditional *RISC* architecture, condependence effects the performance of statistical processors.
- ➢ However, in case of super scalar organizate performance degradation is even more seven since, the control dependence prevents execution of a potentially greater number instructions.





#### Control Dependency

| I                                        | IF | ID | EX | WB |    |    |    |    |
|------------------------------------------|----|----|----|----|----|----|----|----|
|                                          | IF | ID | EX | WB |    |    |    |    |
| I <sub>1</sub> /branch<br>I <sub>2</sub> |    |    |    | IF | ID | EX | WB |    |
| I <sub>3</sub>                           |    |    |    | IF | ID | EX | WB |    |
| I,                                       |    |    |    |    | IF | ID | EX | WB |
| I <sub>5</sub>                           |    |    |    |    | IF | ID | EX | WB |





- A resource conflict arises when two instruct attempt to use the same resource at the s time. Resource conflict is of concern in a so pipelined processor.
- A super scalar processor has a much la number of potential resource conflicts.



#### Resource Dependency



Performance degradation due to the resource depended can be significantly improved by pipelining the funct units.



- A quick look at our previous examples control imply that the data dependencies and resond dependencies have the same effect on instruction pipelining.
- Resource dependence can be resolved moderated by duplicating the hardware pipelining the hardware. However, this is not for the data dependence case.

# Resource and Data Dependencies — Assurption pipelined functional units:





 Increasing parallelism within blocks
 Parallelism within a basic block is limited dependencies between instructions. Some these dependencies are real, some are false:





False depen



Increasing parallelism within blocks

- Smart compiler might pay attention to register allocation in order to overcome f dependencies.
- Hardware register renaming is and alternative to overcome false dependencies.



#### Register Renaming

- Hardware renames the original register identifier in the instruction to correspond the new register with current value.
- Hardware that performs register renaming creates new reinstance and destroys the instance when its value is supersede there are not outstanding references to the value.
- To implement register renaming, the processor typically alloc new register for every new value produced — the same re identifier in several different instructions may access dif hardware registers.



Register Renaming



Each assignment to a register creates a new instance creater.



Increasing parallelism Cross block boundar
 Branch prediction is often used to keep

- pipeline full.
- ➢ Fetch and decode instructions after a brawhile executing the branch and the instruct before it Must be able to execute instruct across an unknown branch speculatively.



#### Increasing parallelism Cross block boundaries

- Many architectures have several kinds of instructures that changes the flow of control:
  - Branches are conditional and have a destination some of from the program counter.
  - Jumps are unconditional and may be either direct or indire
    - A direct jump has a destination explicitly defined instruction,
    - An indirect jump has a destination which is the result of computation on registers.



Increasing parallelism Cross block boundar
 Loop unrolling is a compiler optimizate technique which allows us to reduce number of iterations — Removing a 1 portion of branches and creating larger block that could hold parallelism unavailable becars of the branches.



# Assume the following program: LOOP:

| LD  |
|-----|
| ADD |
| SD  |
| SUB |
| BNZ |

| F <sub>0</sub> ,        | $O(R_1)$                        |
|-------------------------|---------------------------------|
| F <sub>4</sub> ,        | F <sub>0</sub> , F <sub>2</sub> |
| F <sub>4</sub> ,        | $O(R_1)$                        |
| <b>R</b> <sub>1</sub> , | R <sub>1</sub> , #8             |
| $\overline{R}_1$ ,      | Loop                            |

Load vector element into F

- Add Scalar ( $F_2$ )
- Store the vector element
- Decrement by 8 (size of a double
- Branch if not zero



Instruction cycles for a super scalar machine Assume a super scalar machine that issues instructions per cycle, one integer (Load, St branch, or integer), and one floating point: ID WB EX MEM IF IF ID EX MEM WB IF ID EX MEM WB ID EX MEM WB IF IF ID EX MEM WB IF ID EX MEM WB



Clo

We will unroll the loop to allow simultane execution of floating point and integer operation

| Integer   | r Inst.                                                     | Fl. P | oint Inst.                                       |
|-----------|-------------------------------------------------------------|-------|--------------------------------------------------|
| LD        | <b>F</b> <sub>0</sub> , <b>0</b> ( <b>R</b> <sub>1</sub> )  |       |                                                  |
| LD        | <b>F</b> <sub>6</sub> , <b>-8</b> ( <b>R</b> <sub>1</sub> ) |       |                                                  |
| LD        | F <sub>10</sub> , -16(R <sub>1</sub> )                      | AD    | $F_4, F_0, F_2$                                  |
| LD        | $F_{14}$ , -24( $R_1$ )                                     | AD    | F <sub>8</sub> , F <sub>6</sub> , F <sub>2</sub> |
| LD        | $F_{18}$ , -32( $R_1$ )                                     | AD    | $F_{12}, F_{10}, F_2$                            |
| SD        | F <sub>4</sub> , 0(R <sub>1</sub> )                         | AD    | $F_{16}, F_{14}, F_2$                            |
| Fall 2010 |                                                             |       |                                                  |



Clock

**Integer Inst.** 

- $SD \qquad F_8, -8(\overline{R}_1)$
- SD  $F_{12}$ , -16( $R_1$ )
- **SD**  $F_{16}$ , -24( $R_1$ )
- **SD**  $F_{20}, -32(R_1)$
- SUB R<sub>1</sub>, R<sub>1</sub>, #40
- BNZ R<sub>1</sub>, Loop

Fl. Point Inst.

**AD**  $F_{20}, F_{18}, F_2$ 



 Increasing parallelism Cross block boundar
 Software pipelining is a compiler technique moves instructions across branches to incr parallelism — Moving instructions from iteration to another.



#### **Summary**

#### Instruction Level Parallelism

- Dynamic approach
- Static approach

► How to improve ILP within a basic block

- Compiler role
- Register renaming

➢ How to improve ILP cross block boundaries

- Static approach
- Dynamic approach



- Increasing parallelism Cross block boundaries
  - Trace scheduling is also a compiler schedu technique.
  - It uses a profile to find a trace (sequence of bl that are executed often) and schedules instructions of these blocks as a whole — Predic of branch statically based on the profile (to with failure, code is inserted outside the sequ to correct the potential error).



#### Branch Prediction

Simplest way to have dynamic branch prediction is via the so called prediction by or branch history table — A table whose enable indexed by lower portion of the table address.



#### Branch Prediction

- Entries in the branch history table can interpreted as:
  - 1-bit prediction scheme: Each entry says wheth not in previous attempt branch was taken or not.
  - 2-bit Prediction scheme: Each entry is 2-bit and a prediction must miss twice before changed — see the following diagram.





#### Branch Prediction

n-bit Saturation counter: An entry has corresponding history feature. A taken braincrements the counter and untaken braincrement the counter. A branch is not taken is counter is below 2<sup>(n-1)</sup>.



#### Branch Prediction

#### Multilevel prediction — Correlating predict

- A technique that uses behavior of other branch make a prediction on a branch.
- The first level is a table that shows the history of branch. This may be the history (pattern) of the *k* branches encountered (global behavior) or the *k* occurrences of the same branch.
- The second level shows the branch behavior for pattern



#### Question

➤With respect to our earlier definition of C time, discuss how the performance can improved?

$$T = I_{c} * CPI * \tau = \sum_{i=1}^{n} (CPI_{i} * I_{i})$$



#### Beyond RISC

# ► Is it possible to achieve a performation beyond what is being offered by *RISC*

#### Beyond RISC



#### Beyond RISC

- Machine with higher clock rate and de pipelines have been called super pipelined.
- Machines that allow to issue multi instructions (say 2-3) on every clock cycles called super scalar.
- Machines that pack several operations (say into a long instruction word are called V long-Instruction-Word machines.

#### Beyond RISC



A super scalar processor reduces the average nur of clock cycles per instruction beyond what is pose in a pipeline scalar *RISC* processor. This is achies by allowing concurrent execution of instructions in

- The same pipeline stages, as well as
- Different pipeline stages

Multiple concurrent operations on scalar quantities.

#### Beyond RISC

Instruction Timing in a super scalar processor



# Beyond RISC Fundamental Limitations

- Data Dependency
- Control Dependency
- Resource Dependency





Data Dependency

Within the scope of data dependency we can talk at

- Read after write (flow) dependency
- Write after read (anti) dependency
- Write after write (output) dependency

The literature has referred to read after write as dependency, and write after read or write after write false dependency.

Practically, write after read and write a write are due to storage conflict originated from the fact that in traditional systems we are dealing with memory organization that is globally sha by instructions in the program.

Storage medium holds different values different computations.



The processor can remove storage cont by providing additional registers reestablish one-to-one corresponde between storage (register) and values register renaming.

Two constraints are imposed by con dependencies:

➤An instruction that is control dependent of branch cannot be moved before the branch,

An instruction that is not control dependen a branch cannot be moved after the branch.



- When instructions are issued in-order and complete order, there is one-to-one correspondence between sto locations (registers) and values.
- When instructions are issued out-of-order and compout-of-order, the correspondence between register value breaks down. This is even more severe w compiler optimizer does register allocation tries to as few registers as possible.

As noted before, achieving a higher performs means processing a given task in a smaller among of time. To reduce the time to execute a seque of instructions, one can:

Reduce individual instruction latencies, or

Execute more instructions concurrently.

Superscalar processors exploit the sec alternative.

#### General Configuration

Branch outcome/Jump address





#### General Configuration

- Instruction fetch unit acts as a producer, w fetches, decodes, and places decoded instruct into the buffer.
- Instruction execution engine is the consu which removes instructions from buffer executes them, subject to data dependence resource constraints.
- Control dependences provides a feed mechanism between the producer and consume



General Configuration

Systems having this organization employed aggressive techniques to exploit instruction level parallelism.



General Configuration

 Wide dispatch and issue paths.
 Fetch, decode, and issue several instructions
 Large issue buffer,

 Largeipool of physical degisters, dence
 Largeonumber of parallel functional units,
 Speculation of pastemultiple branches.
 Are some techniques that allow aggressis exploitation of Instruction Level Parallel



#### Flow of Operations

- A typical superscalar processor fetches and dec several incoming instructions at a time.
- The outcomes of conditional branch instructions usually predicted in advance to ensure an uninterru stream of instructions
- The incoming instructions are then analyzed for and structural dependencies, and then independencies, and then independencies instructions are distributed to functional units execution.



#### Flow of Operations

- Simultaneously fetching several instructions, o predicting the outcomes of, and fetching beyo conditional branch instructions,
- > Exploit dynamic parallelisms in the program:
  - Determine true dependencies involving regivalues and communicating these values to target instructions during the course execution,
  - Detect and remove false dependencies,
- Initiate or issue multiple instructions in parallel,



#### Flow of Operations

- Manage resources for parallel execution instructions, including:
  - Multiple pipeline functional units,
  - Memory hierarchy

Committing the process state in correct orde



#### Flow of Operations

➤ The key issue to the success of superso systems is the dynamic scheduling of instructions in the program.



#### Historical Perspective

- ➤ The development of architectures to explicit instruction level parallelism in the form pipelining can be traced back to the design CDC6600 and IBM 360/91.
- Within the scope of these systems, prac showed a pipeline initiation rate at instruction per cycle.



Summary
 Out-of-Order Issue, Out-of-Order Completie
 Super Scalar processor
 Dynamic exploitation of ILP
 General Configuration of Super Scalar
 Flow of Operations in a Super Scalar



Processing Flow

An application is represented in a high language program,

This high level program is then compiled into static machine level program — The static prog describes a set of executions and its imp sequencing model (the order in which instruct are executed).



Program Representation — High Level Const.

For 0 = i < lastIf a(i) > a(i+1)temp = a(i)a(i) = a(i+1)a(i+1) = temp

End

Fa

#### Program Representation — Assembly code

| L2:      | Move<br>LW<br>Add<br>LW<br>Ble | $r_3, r_7$<br>$r_8, (r_3)$<br>$r_3, r_3, 4$<br>$r_9, (r_3)$<br>$r_8, r_9, L3$ | $r_7$ points to an element of the $r_8$ holds the i <sup>th</sup> element of the advancing the index $r_9$ holds the i+1 <sup>th</sup> element of |
|----------|--------------------------------|-------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------|
|          | Move<br>SW<br>Add<br>SW<br>Add | $r_3, r_7$<br>$r_9, (r_3)$<br>$r_3, r_3, 4$<br>$r_8, (r_3)$<br>$r_5, r_5, 1$  | In this block i <sup>th</sup> and i+1 <sup>th</sup> ele:<br>are swapped                                                                           |
| L3:      | Add<br>Add                     | r <sub>6</sub> , r <sub>6</sub> , 1<br>r <sub>7</sub> , r <sub>7</sub> , 4    | r <sub>6</sub> holds the index                                                                                                                    |
| all 2010 | Blt                            | r <sub>6</sub> , r <sub>4</sub> , L2                                          | r <sub>4</sub> holds the "last"                                                                                                                   |



#### Processing Flow

- During the course of execution, the sequence executed instructions forms a dynamic instrucstream.
- As long as instructions to be executed sequential, static instruction sequencing car entered into the dynamic instruction sequencin incrementing the program counter.



Processing Flow

- However, in the presence of conditi branches and jumps the program counter is be updated to a nonconsecutive address control dependence.
- The first step in increasing instruction 1 parallelism is to overcome con dependencies.



Control Dependencies — Straight line code

- Let us talk about control dependencies du the incrementing the program counter:
  - The static program can be viewed as a collection basic blocks, each with a single entry point a single exit point, refer to our example, we have basic blocks.



Control Dependencies — Straight line code

- Once a basic block is entered, its instructions fetched and execute to completion, there sequence of instructions in a basic block ca initiated into a conceptual window of execution.
- Once the instructions are initiated, they are free execute in parallel, subject only to the dependence constraints and availability of hardware resources.



- Control Dependencies Conditional Branch
  - To achieve a higher degree of parallelism, a s scalar processor should address updates of program counter due to the conditional branche
  - A method is to predict the outcome of a conditi branch and speculatively fetch and exe instructions from the predicted path.
  - Instructions from predicted path are entered the window of execution.



- Control Dependencies Conditional Branch
  - If prediction is later found to be correct, ther speculation status of the instructions are remain and their effect on the state of the system is same as any other instructions.
  - ➢ If prediction is later found to be incorrect, speculative execution was incorrect and reco actions must be taken to undo the effec incorrect actions.

Processing Flow



➢ In our running example, the *ble* instruction of the output of the output of the structure of the output of

To overcome this dependence, the branch constructed as not taken and he instructions between the branch and label being executed speculatively.

| Move | r <sub>3</sub> , r <sub>7</sub>     |
|------|-------------------------------------|
| SW   | $r_{9}, (r_{3})$                    |
| Add  | r <sub>3</sub> , r <sub>3</sub> , 4 |
| SW   | $r_{8}, (r_{3})$                    |
| Add  | r <sub>5</sub> , r <sub>5</sub> , 1 |



#### Data Dependencies

Instructions placed in the window of execution may begin execution subject to data depend constraints.

Note that data dependence comes in the form of

- Read After Write (RAW),
- Write After Read (WAR), and
- Write After Write (WAW).



#### Data Dependencies

- ➢ Note that, among the three aforementioned dependence, RAW is the true dependence the other two are false (artificial) dependence.
- ➤In the process of execution, the f dependencies have to be overcome to incr degree of parallelism.





Processing Flow

- After resolving control and artif dependencies, instructions are issued and b execution in parallel.
- The hardware form a parallel execusive schedule.
- The execution schedule takes constraints as true data dependence and hardware resore constraints into account.



#### Processing Flow

- A parallel execution schedule means instructions complete in an order different instructions order dictated by the seque execution model.
- Speculative execution means that s instructions may complete execution bey the scope of the sequential execution model.



#### Processing Flow

- Speculative execution implies that the execution results cannot be recorded permanently right awa
- As a result, results of an instruction must be held temporary status until the architectural state ca updated.
- Eventually, when it is determined that the seque model would have executed an instruction temporary results are made permanent by update the architectural state — Instruction is committed retired.

### Super Scalar Architecture





#### Instruction Fetch and Branch Prediction

- For a super scalar implementation, the fetch p must be able to fetch multiple instructions cycle. To achieve this, it has been found usef separate the instruction cache from the data cac
- The number of instructions fetched per c should at least match the peak instruction de and execution rate.
- Instruction buffer is used to smooth the instruction fetch irregularities due to cache misses branches.



#### Instruction Fetch and Branch Prediction

- The default instruction fetching method is increment the program counter by the number fetched instructions.
- The handling of branch (specially the conditional branch) is critical to good performance of a sealar processor. Processing conditional branch branch.
  - Recognizing the branch,
  - Determining the branch outcome,
  - Computing the branch target, and
  - Transferring control.



Instruction Fetch and Branch Prediction

Recognizing the branch

- In general, recognizing the instruction type advanced, can speed up the instruction flow to execution buffer.
- This can be achieved by pre-decoding instruction prior to its residence in the cach Each instruction in the cache is extended by bits.







#### Instruction Fetch and Branch Prediction

- Determining the branch outcome
  - Several techniques can be used to predict outcome of a branch.
  - Some predictors use static information, others dynamic information based on the branch histor
  - Note that, if prediction was incorrect, instrufetching must be redirected to the correct pat addition, if instructions were executed speculati they must be purged and their results mus nullified.



Instruction Fetch and Branch Prediction

Computing the branch target

- Early calculation of the target branch c improve the performance.
- This can be sped up by having a branch ta buffer that holds the target address that was the last time the branch was executed.
- As an example PowerPC 64 uses the Brand Target Address Cache.



### Instruction Fetch and Branch Prediction

#### Transferring Control

- In case of a taken branch, there is at least one of cycle delay to recognize the branch, calculat program counter, and fetching instruction from target address.
- Several techniques can be used to mask out delay:
  - Use instructions in the instruction buffer,
  - Fill out instruction buffer by taken and not taken path
  - Use of delayed branch.



Instruction Fetch and Branch Prediction

Instruction Decoding, Renaming, and Dispatch

- At this stage, instructions from fetch buffer removed, examined, control and data depend relationships are set up, and dispatched to instruction buffers associated with the funct units.
- Often to improve the degree of parallelism, this tries to remove the false dependence by renar the physical storage locations as defined in instructions.

### Instruction Fetch and Branch Prediction





Register Renaming

- $ightarrow 
  m R_3 \Leftarrow 
  m R_3$  op  $m R_5$
- > R<sub>4</sub>  $\Leftarrow$  R<sub>3</sub> + 1
- $ightarrow 
  m R_3 = 
  m R_5 + 1$
- > R<sub>7</sub>  $\Leftarrow$  R<sub>3</sub> op R<sub>4</sub>

 $R_{5} \qquad R_{3b} \Leftarrow R_{3a} \text{ op } R_{5a}$   $R_{4b} \Leftarrow R_{3b} + 1$   $R_{3c} \Leftarrow R_{5a} + 1$   $R_{4} \qquad R_{7b} \Leftarrow R_{3c} \text{ op } R_{4b}$ 

Each assignment to a register creates a instance of the register.



Instruction Fetch and Branch Prediction

### Register Renaming

- There are two register renaming techniqu
  - -The physical register file is larger logical register file,
  - -The physical register file is the same size the logical register file.



#### Instruction Fetch and Branch Prediction

- Register Renaming The physical register file is large logical register
  - A mapping table is used to associate a physical register the current value of a logical register.
  - For each logical destination register a physical register, the list of free registers, is extracted and association be the two is recorded in the mapping table.
  - As part of rename operation, for each source logical reg the mapping table is investigated to find its associated ph register.





#### Instruction Fetch and Branch Prediction

- Register Renaming The physical register file is large logical register file
  - After a physical register has been read for the last time, be returned to the free list to be reused.
  - A counter can be associated to each physical register,
    - It will be incremented whenever it is renamed as a source,
    - It will be decremented each time an instruction issue actually reads a value from the register.
    - A physical register is returned back to free space, if its cou zero and corresponding logical register is renamed.



### Instruction Fetch and Branch Prediction

- Register Renaming The physical register f the same as logical register file
  - This model uses a so called reorder buffer maintains proper instruction ordering (instruc that are dispatched but not yet completed) precise interrupts.
  - Reorder buffer is organized as a circular queue.



### Instruction Fetch and Branch Prediction

- Register Renaming The physical register for the same as logical register file
  - As instructions are dispatched based on seque ordering of the program, they are entered into reorder FIFO buffer.
  - As instructions complete execution, their revalues are inserted into the previously assi entry, wherever it may happen to be in the real buffer.



### Instruction Fetch and Branch Prediction

- Register Renaming The physical register f the same as logical register file
  - When an instruction reaches the head of the built if it is completed, it is removed from the buffer its result is stored into the register file.
  - An incomplete instruction at the head of the b blocks the buffer until it completes.





Instruction Issue and Execution

- Instruction issue is defined as the runchecking for availability of data and resource
- After decode/rename/dispatch phase, the step is to determine which instruction types be issued for execution.
- Three topology has been discussed:



### Instruction Issue and Execution

#### Single Queue Model

- If there is a single queue and no out-of-order exec then renaming is not necessary and operand availa can be managed via a simple reservation bits assign each register.
- A register is reserved when an instruction modifyin issued.
- A register is cleared when the instruction completes.
- An instruction is issued if there is no functional co and no reservation on its operand.



### Instruction Issue and Execution

#### Multiple Queue Model

- In this case, instructions are issued from each q in order, but the queues may issue out of order respect to one another.
- The instruction queues are organized according the instruction types.



### Instruction Issue and Execution

#### Reservation Station

- In this case, instructions are issued out-of-order of the reservation stations simultaneously mo their source operand for availability of data.
- When an instruction is dispatched to the reservent station, any already available operand values are from the register file into the reservation station



### Instruction Issue and Execution

- Reservation Station
  - Reservation station compares the open designators of unavailable data with the reduction designators of completing instructions. When is a match, the result value is pulled into matching reservation station — sort of forwarding
  - When all the operands are available in reservation station, the instruction may be issued



### Committing State

- The final phase of an instruction is commit or a state.
- In the commit state, the effects of the instruction allowed to modify the logical state of the process.
- The purpose of this phase is to implement appearance of a sequential execution model though the actual execution is not sequential due to speculative execution and out-of-order execution.



### Committing State

- In one approach, the state of the machin certain points are saved (check pointed) ei in a history buffer or a checkpoint.
- Instructions update the state of the machin they execute and when a precise state is nee it is recovered from the history buffer.



### Committing State

- In another approach, the state of the machin separated into the:
  - Physical state, and
  - Logical (architectural) state.

Committing State

The physical state is updated as the oper completes.

The logical state is updated in sequential proportion order, as the speculative state of the operation cleared. The speculative state is maintained in a red buffer.

To commit an instruction, its result has to be rem from the reorder buffer into the architectural reg file.

### MIPS R1000 — General Configuration





### MIPS R1000

- Performs extensive dynamic scheduling,
- Fetches four pre-decoded instructions at a from an instruction cache of 512 lines,
- Pre-decoding extends each binary instruction by 4 bits,
- Physical register file (64) is twice the log register file (32),



#### MIPS R1000

- Supports a Branch prediction table of 512 entries a contained within the instruction cache mechar Each entry holds 2-bit counter to show the hi information,
- For predicted taken branch, one cycle is needer redirect instruction fetching. During this c sequential instructions are fetched and placed resume cache – a space of four instructions blocks,



#### MIPS R1000

➢ In case of a predicted branch, a snapshot of register mapping table is taken — up to snapshots can be stored at the same time,



### MIPS R1000

Dispatches up to four instructions into t instruction queues:

- Memory,
- Integer, and
- Floating point

Instruction queues are 16 entries deep and a full queue can block dispatch unit,



### MIPS R1000

Supports five functional units:

- An address adder,
- Two integer ALUs,
- A floating point multiplier/divider/square rooter
- A floating point adder



### MIPS R1000

- Supports an on-chip 2-way set associative Kb) primary cache and an off-chip second cache,
- Uses re-order buffer to maintain a precise s at the time of exception,
- Commits instructions in the original prog sequence, up to four at a time.

### DEC Alpha 21164 — General Configuration





#### DEC Alpha 21164

- Compromises dynamic scheduling in favor of a his clock rate,
- Fetches four instructions at a time from an 8 Kl instruction cache,
- Has two instruction buffers, each capable of hol four instructions,
- Issues instructions from instruction buffer in program order. An instruction buffer must be emp before the other being used,



#### DEC Alpha 21164

- Instruction cache is enhanced by a bra history table, each entry is a 2-bit counter,
- At most one predicted and yet unreso branch can exist at a time,
- ➢ Following instruction fetch and dec instructions are inspected and upon availab of operands they are issued — during process instructions are not allowed to pass another,



DEC Alpha 21164

Supports four functional units:

- Two integer ALUs,
- A floating point adder, and
- A floating point multiplier,

Supports two level of on-chip caches:

- Primary cache dedicated each 8 Kbytes,
- Secondary cache unified 96 Kbytes.



# Limitations

- Limit on instruction level parallelism,
- Complexity of superscalar.

VIIW

Hardware complexity and scalability two major issues that question the valion of the Super Scalar machines.

Consider the following cases:

The instruction issue logic of PA-8000 four issue Super Scalar machine with instruction queue entries) occupies 20% the die area.

VLW

As the issue width increases the need register renaming, complexity of bypass forwarding, and interlocks increaded and interlocks increaded and the second s

#### Very Long Instruction Word — VII

Increasing hardware complexity pays b lesser dividends:

Performance of a 200 MHZ MIPS R500 single issue machine) on SPEC95 is about ' of a 200MHZ MIPS R10000 (a four i machine).

Commercial VLIW machines introduced in early 80's.

Multiflow delivered:

- Trace/200 256-bit 7 wide issue machine,
- Trace/300 256-bit 7 wide issue machine, and
- Trace/500 512, 1024-bit 14, 28 wide machine.

Cydrome delivered:

• Cydra 5 – 256-bit 6 wide issue machine.



Present commercial VLIW machines:
 IA-64 is a joint venture between Intel and H
 Philips Trimedia processor, a DSP chip multimedia applications,
 CRUSOE by Transmeta.



### Very Long Instruction Word — VIIW

Very Long Instruction Word (VLIW) destakes advantage of instruction parallelism reduce number of instructions by pack several independent instructions into a long instruction.

The principle behind VLIW is similar that of parallel computing — exec multiple operations in one clock cycle.

VLIW arranges all executable operations in word simultaneously — many static scheduled, tightly coupled, fine-grained operat execute in parallel within a single instrucstream.

Naturally, the more densely the operations can compacted, the better the performance (lo number of long instructions).



During compaction, NOOPs can be used operations that can not be used.

To compact instructions, software must be to detect independent operations.

A VLIW instruction might include two int operations, two floating point operations, memory reference operations, and a bra operation.

VLIV

The compacting compiler takes ordinates ordinates and compresses it into very instruction words through unrolling loops trace scheduling scheme.

#### Block Diagram



<u>MAN</u>

### Very Long Instruction Word — VLIW

- This organization is very similar to heterogeneous multiprocessor system, however
  - Each long instruction contains op. code to control individual processors,
  - Instructions are in a single flow of control instruction is fetched, all the processors do individual operations, then the next instructio fetched — Only one locus of control,
  - Instruction word completely controls all communication among the processors.

VLIW fits somewhere between SIMD and M in Flynn's taxonomy.

VEIW

- VLIW consists of multiple functional unit single control unit, and a single monolithic reg file.
- The processor fetches from the instruction can very long instruction containing a set of prim instructions, and dispatches them simultaneo for parallel execution.



This architecture allows any functional to access any registered data. This imp too many ports to fully connect the regi file to all functional units.

VLW

More realistic solution, partitions regi file into banks and allows a subset functional units to have exclusive acces each register bank.

# Very Long Instruction Word — VIII

Assume the following FORTRAN code its machine code:

> C = (A \* 2 + B \* 3) \* 2 \* i,Q = (C + A + B) - 4 \* (i + j)



Machine code:

1) LD A  
3) 
$$t_1 = A * 2$$
  
5)  $t_3 = t_1 + t_2$   
7)  $t_4 = 2 * I$   
9) ST C  
11)  $t_5 = I + J$ 

13) 
$$t_7 = A + B$$

18

6

2) LD B 4)  $t_2 = B * 3$ 6) LD I 8)  $C = t_4 * t_3$ 10) LD J 12)  $t_6 = 4 * t_5$ 14)  $t_8 = C + t_7$ 16) ST Q

Fall 2010

IJ)

# Very Long Instruction Word — VERY



| LD0  | LD1  | INT0      | INT1  | FP0                             | FP1       | BR |
|------|------|-----------|-------|---------------------------------|-----------|----|
| LD A | LD B |           |       |                                 |           |    |
| LD I | LD J |           |       | A * 2                           | B * 3     |    |
|      |      | 2 * I     | I + J | $t_1 + t_2$                     | A + B     |    |
|      |      | $4 * t_5$ |       | t <sub>4</sub> - t <sub>3</sub> |           |    |
| ST C |      |           |       |                                 | $C + t_7$ |    |
|      |      |           |       | $t_8 - t_6$                     |           |    |
| ST Q |      |           |       |                                 |           |    |

VEW



Summary

VLIW — General Philosophy

VLIW — General Configuration

VLIW — Advantages and Disadvantages

►VLIW — An example

Assume a VLIW machine capable issuing two floating point operations, memory reference operations and integer/branch operation.

Further assume the following loop:

#### LOOP:

ADD  $F_4$ ,  $F_0$ ,  $F_2$  Add Scalar ( $F_2$ ) BNZ  $R_1$ , Loop

LD  $F_0$ ,  $O(R_1)$  Load vector element into  $F_0$ SD  $F_4$ ,  $O(R_1)$  Store the vector element SUB  $R_1, R_1, \#8$  Decrement by 8 (size of a double Branch if not zero

We will unroll the loop seven times to optimal performance:



|   | Memory<br>Reference 1      | Memory<br>Reference 2                   | Fl. Point 1                                                                | Fl. Point 2                                           | Inte |
|---|----------------------------|-----------------------------------------|----------------------------------------------------------------------------|-------------------------------------------------------|------|
| 1 | LD $F_0, 0(R_1)$           | LD F <sub>6</sub> , -8(R <sub>1</sub> ) |                                                                            |                                                       |      |
| 2 | LD $F_{10}$ , -16( $R_1$ ) | LD $F_{14}$ , -24( $R_1$ )              |                                                                            |                                                       |      |
| 3 | LD $F_{18}$ , -32( $R_1$ ) | LD $F_{22}$ , -40( $R_1$ )              | AD $F_4$ , $F_0$ , $F_2$                                                   | AD $F_8, F_6, F_2$                                    |      |
| 4 | LD $F_{26}$ , -48( $R_1$ ) |                                         | <b>AD</b> $F_{12}, F_{10}, F_2$                                            | AD F <sub>16</sub> , F <sub>14</sub> , F <sub>2</sub> |      |
| 5 |                            |                                         | <b>AD F</b> <sub>20</sub> , <b>F</b> <sub>18</sub> , <b>F</b> <sub>2</sub> | AD $F_{24}, F_{22}, F_2$                              |      |
| 6 | <b>SD</b> $F_4$ , $0(R_1)$ | <b>SD</b> $F_8$ , -8( $R_1$ )           | AD F <sub>28</sub> , F <sub>26</sub> , F <sub>2</sub>                      |                                                       |      |
| 7 | SD $F_{12}$ , -16( $R_1$ ) | SD $F_{16}$ , -24( $R_1$ )              |                                                                            |                                                       |      |
| 8 | SD $F_{20}$ , -32( $R_1$ ) | SD $F_{24}$ , -40( $R_1$ )              |                                                                            |                                                       | SUI  |
| 9 | SD $F_{28}$ , -48( $R_1$ ) |                                         |                                                                            |                                                       | BN   |

V B

#### VLIW Compiler Technology

- Effectiveness of VLIW architecture is he dependent on the exploitation of parallelisms in application program by the compiler.
- > The compiler must be efficient and clever.
- VLIW compilers heavily use:
  - Speculative scheduling when branch is encountered,
  - Loop unrolling to expose more instruction level parallelis
  - Software pipelining,
  - Inlining to reduce the overhead associated with proc calls, and
  - Trace Scheduling.

- VLIW machine can actually run two pos paths of a branch.
- When the branch is computed at run time and correct path is known, the incorrect branc discarded and execution continues as if there no branch.
- Consequently, the branch penalty can be all zero compared to a scalar processor.

#### Trace Scheduling

- This technique is applied once all intermed optimizations are complete and the code machine level instructions.
- Trace is a linear section of code that mus executed together.
- Trace scheduling is composed on two parts:
  - Selection, and
  - Compaction



#### Trace Scheduling

- Selection modules, recursively, selects the t with highest probability of execution and se it to the compaction module.
- Trace compaction uses several technique rearrange instructions within a trace minimize wasted instructions in a long v and consequently, to reduce the total number long words.

#### Trace Scheduling

➢ If an instruction in the trace is moved for before a conditional jump to after the jum copy of it must be placed in the off-trace of the jump.

| 1: O <sub>1</sub> |                                 | 2: if cond. <b>–</b> | <b>→</b> 1': |
|-------------------|---------------------------------|----------------------|--------------|
| 2: if cond. —     | $\rightarrow$ 4: O <sub>3</sub> | 1': O <sub>1</sub>   | 4:           |
| 3: 0,             | 5° 0,                           | 3. 0.                | 5.           |

#### Trace Scheduling

If a trace operation is moved above a rejoin trace, then a copy must be placed on the trace rejoin edge.





Block X is said to post dominate block every path through Y must ultimately through X.



4': 
$$O_4$$
  
1: if cond.  $\longrightarrow$  5: 0  
2:  $O_2$   
3:  $O_3$ 







#### ► IA-64 — Register Configuration

General Purpose Registers



First 32 registers are used Statically, and rest will be used as Stacked/rotating re Stacked/rotating registers are used to s Procedure calls.

### ►IA-64 — Register Configuration

**Floating Point Registers** 



Register rotating is used to ease the Task of allocating registers in softw Pipelined loops.

### ►IA-64 — Register Configuration

Special Application Registers



Special purpose application registers are used to support features such as Register stack, Software pipelining,.



#### ►IA-64 — Register Configuration

**Branch Registers** 



Branch registers are used For indirect branches.

►IA-64 — Register Configuration



VIII

Each bit represents the result of a conditional expression evaluation.



#### ►IA-64 — Instruction Format

| Op-code | Reg. 1 | Reg. 2 | Reg. 3 | Predicate |
|---------|--------|--------|--------|-----------|
| ← →     | · >    | ←───   | ←───   | ← →       |
| 14 bits | 7 bits | 7 bits | 7 bits | 6 bits    |

►IA-64 — Instruction Format (instruction bund

**VEEN** 



Template field specifies what type of Execution Units each instruction in a group requires.



#### Compilation techniques to exploit ILP

- Loop Unrolling
- Trace Scheduling
- Software Pipelining
- Speculative scheduling

► IA-64

- General Configuration
- Register sets
- Instruction Format

#### ►IA-64

#### >Two levels of parallelisms are provided:

- Instruction Level Parallelism Compiler cr instruction groups (a collection of instru bundles) so that all instructions in an instru group can be executed in parallel safely.
- Control Flow Parallelism is provided executing compound And and Or condition parallel. This allows several multiway branch be grouped together and executed in a sinstruction group.

Very Long Instruction Word — VIIII
IA-64 — Compound Conditional Code
Assume the following code:

If  $((a = 0) || (b \le 5) || (c \ne d) || (f > 10))$ r<sub>3</sub> = 8;

IA-64 — Compound Conditional Code
 In IA-64 the aforementioned code is express

Cmp.ne  $p_1 = r_0, r_0$ Add t = -5, bAdd k = -10, f

Cmp.eq.or  $p_1 = 0$ , a Cmp.ge.or  $p_1 = 0$ , t Cmp.ne.or  $p_1 = c$ , d Cmp.gt.or  $p_1 = 0$ , k

 $(p_1) \mod r_3 = 8$ 

IA-64 – Compound Conditional Code
 Register p<sub>1</sub> is initialized to false,
 The conditions for each of the OR express is calculated in parallel, and
 Final result of the p<sub>1</sub> is used in the instruction.



- ►IA-64 reduces the negative effect of branches.
- ➤It allows the compiler to generate the cod execute instructions from multiple conditi paths at the same time.

VEW

IA-64 — Branches
 Assume the following code:

If  $(r_1 = r_2)$   $r_9 = r_{10} - r_{11};$ else  $r_5 = r_6 + r_7;$ 



#### ►IA-64 — Branches

Ability to calculate compound condition codes in parallel and associating a predito each statement allows compiler to b larger basic blocks and hence to increase degree of instruction level parallelism.





#### ►IA-64 — Branches

Trace scheduling then will be appreciated approximation of the extensively to schedule traces with high probability of execution earlier.



#### Speculative load

- Speculative load is an interesting technique allows to speculatively early start execution of critical instructions (load).
- Load can be safely scheduled ahead of one or 1 branches.
- In case of exception, flag is raised and attached to load result — At runtime, a deferred exception tok written to the target register (extra bit of register).
- At proper moment, this flag is checked to redirec control to a fix-up code.



#### Speculative load

- ➢Note that almost all instructions in IA propagate the tag on a register, as a re entire calculation chains may be sched speculatively.
- The compiler only inserts a single *chk.s* (cl speculate) to check the result of mult speculative computations.

#### Data speculation

- ►IA-64 allows the compiler to schedule a before one or more prior stores speculation.
- Advanced load instruction (*ld.a*) along advanced load check instruction (*chk.a*) used to accomplish this.





Data speculation

An advanced load is a load that has a speculatively moved above store instruction which it is potentially dependent.



#### Data speculation

- Advanced load is similar to traditional 1 During the run time, system rec information such as:
  - The target register,
  - Memory address accessed, and
  - Access sized

In the advanced load address table.

Data speculation

➤When a store is executed, an associative up against the active advanced load add table is performed. If there is a match, advanced load address entry is marked invalid (it is cleared).

#### Data speculation

- Later, when the chk.a is executed, hardy checks the advanced load address table for entry installed by its corresponding advantational.
  - If an entry is found, the speculation was succe and nothing will happen.
  - If no entry is found, there may have been a colli and the check instruction branches to a fix-up to reexecute the code.

#### Questions

- Compare and contrast VLIW archited against multiprocessor and vector proce (you need to discuss about issues such as flow of control, inter-proce communications, memory organization programming requirements).
- Within the scope of VLIW architecture, dis the major source of problems.

- VLIW machines are not software compatible with general purpose machine. Even they are compatible with themselves.
- Density of long instructions depend on the instruction level parallelism detected during the compila This could effect space utilization drastically.
- VLIW offers performance improvement.
- There is no need for extra hardware in order to deparallelism.



Naturally, in a super Pipelined Process sub-stages are clocked at a higher freque than the major stages.

Reducing processor cycle time, he higher performance, relies on instruct parallelism to prevent pipeline stalls in sub-stages.

In comparison with Super Scalar:

- ➢ For a given set of operations, the s pipelined processor takes longer to generate results than the super scalar processor.
- Simple operations take longer time to exe in a super scalar than super pipelined, s there are no clock with finer resolution.

From hardware point of view, super so processors are more susceptible to reso conflicts than super pipelined processor. A result hardware should be duplicated for so scalar processor. On the other hand, in so pipelined processor, we need latches betwee pipeline sub-stages. This adds overhead computation — degree of super pipeline could add severe overhead.

### Super Pipelined Superscalar Process

Since the number of instructions issued cycle and the cycle time are theoretic orthogonal, we could have a super pipeli superscalar machine.



2-Stage 2-issue Super Pipelined Superscalar Prod



As noted before, achieving a higher performance means processing a given task in a smaller among of time. To reduce the time to execute a sequence of instructions, one can:
 Reduce individual instruction latencies, or
 Execute more instructions concurrently.

Superscalar processors exploit the sec alternative.



Faster Clock Rate





### Software pipelining





### Software pipelining

#### Software pipelining requires

- Managing the loop count,
- Handling renaming the registers for the pipeline
- Finishing the work in progress when the loc ended.
- Starting the pipeline when the loop is entered, an
- Unrolling to expose cross-iteration parallelism.



#### Software pipelining

Software pipelining is a technique reorganizes loops such that each iteration in software-pipelined code is made fi instructions chosen from different iteration the original loop — it interleaves instruct from different iterations without unrolling loop.



## Software Pipelining LOOP:

- ADD  $F_4$ ,  $F_0$ ,  $F_2$  Add Scalar ( $F_2$ ) BNZ  $R_1$ , Loop Branch if not zero
- LD  $F_0$ ,  $O(R_1)$  Load vector element into  $F_0$ SD  $F_4$ ,  $O(R_1)$  Store the vector element SUB  $R_1, R_1, \#8$  Decrement by 8 (size of a double



| Software Pipelining |     |                 |  |  |
|---------------------|-----|-----------------|--|--|
| Iteration i:        | LD  | $F_0, 0(R_1)$   |  |  |
|                     | ADD | $F_4, F_0, F_2$ |  |  |
|                     | SD  | $F_4, 0(R_1)$   |  |  |
| Iteration i+1       | LD  | $F_0, 0(R_1)$   |  |  |
|                     | ADD | $F_4, F_0, F_2$ |  |  |
|                     | SD  | $F_4, 0(R_1)$   |  |  |
| Iteration i+2       | LD  | $F_0, 0(R_1)$   |  |  |
|                     | ADD | $F_4, F_0, F_2$ |  |  |
|                     | SD  | $F_4, 0(R_1)$   |  |  |
|                     |     |                 |  |  |



## Software Pipelining LOOP:

- SD  $F_4$ , 16( $R_1$ ) ADD  $F_4, F_0, F_2$ LD  $F_0$ ,  $O(R_1)$  Loads M[i-2] SUB  $R_1, R_1, \#8$  Decrement by 8 BNZ R<sub>1</sub>, Loop
  - Stores into M[i] Adds to M[i-1] Branch if not zero