## CSC 631: High-Performance Computer Architecture

Spring 2022 Lecture 8: VLIW

# **VLIW: Very Long Instruction Word**



- Each operation slot is for a fixed function
- Constant operation latencies are specified
- Architecture requires guarantee of:
  - Parallelism within an instruction => no cross-operation RAW check
  - No data use before data ready => no data interlocks

## **VLIW Compiler Responsibilities**

- Schedule operations to maximize parallel execution
- Guarantees intra-instruction parallelism
- Schedule to avoid data hazards (no interlocks)
  - Typically separates operations with explicit NOPs

#### **Loop Execution**



How many FP ops/cycle?

1 fadd / 8 cycles = 0.125



Need to handle values of N that are not multiples of unrolling factor with final cleanup loop

5

#### **Scheduling Loop Unrolled Code**



How many FLOPS/cycle?

4 fadds / 11 cycles = 0.36



7

#### Software Pipelining vs. Loop Unrolling



Software pipelining pays startup/wind-down costs only once per loop, not once per iteration

# What if there are no loops?



- Branches limit basic block size in control-flow intensive irregular code
- Difficult to find ILP in individual basic blocks

9

## Trace Scheduling [Fisher,Ellis]

- Pick string of basic blocks, a trace, that represents most frequent branch path
- Use profiling feedback or compiler heuristics to find common branch paths
- Schedule whole "trace" at once
- Add fixup code to cope with branches jumping out of trace

# **Problems with "Classic" VLIW**

- Object-code compatibility
  - have to recompile all code for every machine, even for two machines in same generation
- Object code size
  - instruction padding wastes instruction memory/cache
  - loop unrolling/software pipelining replicates code
- Scheduling variable latency memory operations
  - caches and/or memory bank conflicts impose statically unpredictable variability
- Knowing branch probabilities
  - Profiling requires an significant extra step in build process

#### Scheduling for statically unpredictable branches

- optimal schedule varies with branch path

### **VLIW Instruction Encoding**



- Schemes to reduce effect of unused fields
  - Compressed format in memory, expand on I-cache refill
    - used in Multiflow Trace
    - introduces instruction addressing challenge
  - Mark parallel groups
    - used in TMS320C6x DSPs, Intel IA-64
  - Provide a single-op VLIW instruction
    - Cydra-5 UniOp instructions

# Intel Itanium, EPIC IA-64

- EPIC is the style of architecture (cf. CISC, RISC)
  - Explicitly Parallel Instruction Computing (really just VLIW)
- IA-64 is Intel's chosen ISA (cf. x86, MIPS)
  - IA-64 = Intel Architecture 64-bit
  - An object-code-compatible VLIW
- Merced was first Itanium implementation (cf. 8086)
  - First customer shipment was expected in 1997 (actually 2001)
  - McKinley, second implementation shipped in 2002
  - Poulson, eight cores, 32nm, 2012
  - Kittson or Itanium 9700, 2017.
    - Similar to Poulson but a higher clock
- Kittson was discontinued in 2019
  - Last ship date July 2021

# **IA-64 Instruction Format**



#### 128-bit instruction bundle

- Template bits describe grouping of these instructions with others in adjacent bundles
- Each group contains instructions that can execute in parallel



### **Intel Kills Itanium**

- Donald Knuth " ... Itanium approach that was supposed to be so terrific—until it turned out that the wished-for compilers were basically impossible to write."
- "Intel officially announced the end of life and product discontinuance of the Itanium CPU family on January 30th, 2019", Wikipedia