# The Itanium® Architecture ## A Technical Overview Thomas Siebold Technical Consultant Transition Engineering & Consulting Business Critical Server Division thomas.siebold@hp.com Rev. 6.5 © 2004 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice # The Itanium® Architecture ## A Technical Overview Thomas Siebold Technical Consultant Transition Engineering & Consulting Business Critical Server Division thomas.siebold@hp.com Rev. 6.5 © 2004 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice ## Language and cultural differences This is a ,mobile phone' ...but in Germay it is called a ,handy' ...but in other countries a ,handy' is a ..... ## Agenda - Terminology - Itanium® Roadmap - The Itanium® Architecture ## **Terminology** ## Processor Architectures and Implementations **Alpha® Processor Family** ## Working Together ## Continue Working Together - Alpha technology/resources enhance Itanium®-based compilers/tools (SW) - Alpha technology/resources accelerate and enhance Itanium® Architecture processors/platforms (HW) ### Intel® Itanium® Processor Family Roadmap #### **Performance** **Lowest Cost of Ownership** ### Multi-Processor (MP) Capable Leading Performance Itanium<sup>®</sup> 2 Processor 1.5GHz, 6M; 1.4GHz, 4M; 1.3GHz, 3M Itanium<sup>®</sup> 2 Processor (Madison 9M) >1.5GHz, 9M Montecito Dual Core, 24MB, 90nm Technology Tukwila Multi Core, Developed with ex-Alpha team ### Dual Processor (DP) Capable Leading \$/FLOP Itanium<sup>®</sup> 2 Processor 1.4GHz, 1.5M, DP Itanium® 2 Processor >1.4GHz, DP Future DP Future DP ### Dual Processor (DP) Capable Lower Power LV Itanium<sup>®</sup> 2<sup>†</sup> Processor 1.0GHz, 1.5M, DP LV Itanium® 2 Processor >1.0GHz, DP Future DP, Low Voltage Future DP, Low Voltage 2003 2004 2005 Next Generation Long term Itanium® Roadmap Strength ## Delivering on the Architecture #### MP/DP CAPABLE All dates specified are target dates, are provided for planning purposes only and are subject to change. ### Itanium ### Itanium™ Processor ### Itanium2 - McKinley / Madison 2002 2003 2004 2005 Itanium<sup>®</sup> 2 Itanium<sup>®</sup> 2 Itanium<sup>®</sup> 2 Processor Processor Montecito Processor (Madison & Deerfield) (Madison 9M) (1 GHz, 3MB L3) (1.5GHz, 6MB L3) (>1.5GHz, 9MB L3) O.18 µm O.13 µm 90 nm 12 ### Madison\*\* \*\*codename 3rd Generation Itanium® Architecture Processor 130nm Process, 410M Transistors 1.5GHz Frequency 6 GFLOPS DP-F.P Peak 6MB integrated L3-Cache (48GB/s) Pin-Compatible to Itanium® 2 Processor 100% Binary Compatible Same Thermal Envelope Low-Voltage Version (Deerfield\*\*) in 2H2003 ~1.3-1.5x faster than Itanium® 2 ### Itanium® 2 Processor Block Diagram (schematic overview) # Intel® Itanium2®-based microarchitecture block diagram ### Montecito\*\* \*\*codename 5<sup>th</sup> Generation Itanium® Architecture Processor 90nm Process Dual Enhanced Core per Die High Frequency 12MB integrated L3-Cache per Core Multi-Threading Support Some few new Instructions Low-Voltage Version as well Target in 2005 All features and dates specified are targets provided for planning purposes only and are subject to change ### Itanium2 Processor ("McKinley") 221M FETs 421mm<sup>2</sup> 90+% of the transistors and 50+% of the die area are devoted to <u>cache</u> and cache support logic Madison: ≈ 410M FET Montecito: ≈ 1000M FET ## Intel Enterprise Micro-Architectures Xeon® Processor w/ 54-bit Extensions Itanium® 2 Processor 9 #### **Performance via Parallelism** ## Itanium is uniquely architected for performance Itanium integrates the best of IA32 performance technology with forward-looking architectural enhancements ### x86 32-bit/64-bit Xeon processor Optimized for cost/performance performance in small to medium scale application and databases #### **EPIC 64-bit Itanium processor architecture** - Optimized for best throughput performance in large and complex technical and commercial workloads - Performance is much more than 64-bits | X86 32b/64b Xeon | <u>Itanium System features</u> | Efficient operation; high performance: Reduced context switching Efficient workload management Efficient clock-cycle utilization | | | |--------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------|--|--| | <ul><li>24 to 40 general registers</li><li>Thread-level parallelism<br/>(Hyperthreads)</li></ul> | <ul> <li>264 application registers + 64 predicate registers</li> <li>Instruction-level parallelism + corelevel parallelism*</li> </ul> | | | | | Hardware-based parallelism | Data and control speculation | Improve effective memory latency | | | | Dual-core implementations* | Dual-core + multi-core implementations* | <ul><li>Higher performance density</li><li>Better system price/performance</li></ul> | | | | Performance driven by high-<br>clock-rates (>3GHz) | Improved clock-cycle utilization | Sustained performance advantage for business criticial applications | | | | Mature development tools and compiler optimization | Core hardware performance improved by future compiler optimizations | Installed systems get faster, even without hardware upgrades | | | ## Itanium® Architecture: Optimized for Multi-Core - Parallel execution leadership: only Intell has all 3: - Multi cores on same die - Multi threads on same core - Explicit Parallelism in each core - EPIC\*: inherent advantages for multicore, multi-thread - Architecture: Parallelism + many registers to keep data on-chip - Core size: Smaller than IA-32, up to 2X more cores per die on Tukwila (than on IA-32) \* For Enterprise & Technical Computing Application Segments Itanium® Processor family delivers >2X Moores' Law performance <sup>\*</sup> EPIC is Itanium's architecture "Explicitly Parallel Instruction Set Computing" # The Itanium® Architecture ## Explicitly Parallel Instruction Computing Basic Ideas ### Static Hardware Design - -Compiler creates record of execution - Instructions in bundles - -Machine plays record - Distribute among execution units - -No runtime changes like out-of-order-excution ### High Scalability of ,execution units' - -Very Large Instruction Word (VLIW) concept - -Focus is parallelism - 6 instructions in parallel (2 bundles per cycle) -High number of execution units ### Itanium Architecture – Basic Ideas Increased parallelization - more throughput ## Traditional Architecture Limits EPIC Solutions Today's Limits: complexity of multiple pipelines too great to allow effective on-chip scheduling for parallel operation → Solution: explicit parallelism Compiler handles Scheduling and communicates this to the chip Today's Limit: number of registers on chip limits parallelism → Solution: quadruple registers from 32 to 128 Today's Limit: Large (and growing) memory latency → Solution: speculative loads Today's Limit: conditional and/or unpredictable branches → Solution: prediction and predication orchestrated by the compiler ### Architecture Limits – EPIC Solutions Today's Limits: complexity of multiple pipelines too great to allow effective on-chip scheduling for parallel operation → Solution: explicit parallelism Compiler handles Scheduling and communicates this to the chip Today's Limit: number of registers on chip limits parallelism → Solution: quadruple registers from 32 to 128 and increasing addressing from 5 bits to 7 Today's Limit: Large (and growing) memory latency → Solution: speculative loads Today's Limit: conditional and/or unpredictable branches → Solution: prediction and predication orchestrated by the compiler # Increasing Instruction Level Parallelism ## Explicit Parallelism - Instruction Level Parallelism (**ILP**) is the ability to execute multiple instructions at the same time - Explicitly Parallel Instruction Computing (**EPIC**) allows the compiler or assembler to specify the parallelism - Compiler specifies **Instruction Groups**, a list of instructions with no dependencies that can be executed in parallel - Instructions are packed in **bundles** of 3 instructions each - Instruction bundle - Two executed per cycle - Massive resources on chip - Large number of registers to avoid register contention ## Instruction Format: Bundles & Templates - Bundle (123 bits) - Set of three instructions (41 bits each) - Template (5 bits) - Identifies types of instructions in bundle - •One of Integer, Memory, Branch, Floating, eXtended - •Identifies independent operations ("stops") -> MM\_F - Defines execution units to be invoked executing the bundle - Compiler can schedule functional units to avoid contention # Instruction Format: Bundles & Templates - Instruction types - M: Memory - I: Shifts and multimedia - A: Integer Arithmetic and Logical Unit - B: Branch - F: Floating point - L+X: Long (move, branch, ...) - Template encodes types - MII, MLX, MMI, MFI, MMF - Branch: MIB, MMB, MFB, MBB, BBB - Template encodes parallelism - All come in two flavors: with and without stop at end - Also, stop in middle: MI\_I M\_MI ### Explicitly Parallel Instruction Encoding ### Instruction Dispersal, Itanium® Implementation Flexible Issue Capability Up to 6 instructions executed per clock ## Explicitly Parallel Instruction Computing EPIC ### Defined templates | <b>Execution U</b> | nits | |--------------------|------| |--------------------|------| Memory Integer FP Branch | Maies | | | | | | | | | |-------|-------|---|---|---|--|--|--|--| | 0 | MII | М | I | I | | | | | | 1 | MII; | M | I | I | | | | | | 2 | MI;I | M | I | I | | | | | | 3 | MI;I; | M | I | I | | | | | | 4 | MLX | M | I | I | | | | | | 5 | MLX; | M | I | I | | | | | | 8 | MMI | M | M | I | | | | | | 9 | MMI; | M | M | I | | | | | | 10 | M;MI | M | M | I | | | | | | 11 | M;MI; | M | M | I | | | | | | 12 | MFI | M | F | I | | | | | | 13 | MFI; | M | F | I | | | | | | 14 | MMF | M | M | F | | | | | | 15 | MMF; | М | M | F | | | | | | 16 | MIB | М | I | В | | | | | | 17 | MIB; | M | I | В | | | | | | 18 | MBB | М | В | В | | | | | | 19 | | М | В | В | | | | | | | MBB; | В | В | В | | | | | | 22 | BBB | В | В | В | | | | | | 23 | BBB; | м | м | В | | | | | | 24 | MMB | | | | | | | | | 25 | MMB; | M | М | В | | | | | | 28 | MFB | M | F | В | | | | | | 29 | MFB; | M | F | В | | | | | ## Itanium® 2 Dispersal Matrix | | MII | MLI | MMI | MFI | MMF | MIB | MBB | BBB | MBB | MFM | |------|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----| | MII | | | | | | | | | | | | MLI | | | | | | | | | | | | ММІ | | | | | | | | | | | | MFI | | | | | | | | | | | | MMF | | | | | | | | | | | | MIB* | | | | | | | | | | | | MBB | | | | | | | | | | | | BBB | | | | | | | | | | | | MMB* | | | | | | | | | | | | MFB* | | | | | | | | | | | \* hint in first bundle Possible Itanium® 2 full issue Possible Itanium® processor and Itanium® 2 full issue Itanium® 2 allows more compiler dispersal options ### Instruction Groups - Instruction groups: - Set of instructions - No dependencies (raw, waw) within group - May execute in parallel - The processor executes as many instructions per instruction group as possible, based on its resources - Must contain at least one instruction (no upper limit) - Instruction groups are indicated by cycle breaks (;;) ## Instruction groups and bundles ``` ld8 r5 = [r7] sub r1 = r2, r3 add r10 = r20, r21 ;; add r1 = r1, r5 ;; st8 [r7] = r1 ``` Instructions within a group may not have any register dependencies within the group. ;; indicates the end of a group. #### Instruction bundles Instructions are fetched and executed in bundles. #### Instruction groups and bundles Itanium® and Itanium2® fetch 2 bundles at a time for execution. They may or may not execute in parallel. There are two difficulties: - 1) Finding instruction triplets matching the defined templates. - 2) Matching pairs of bundles that can execute in parallel. #### Massive On Chip Resources Several register files visible to the programmer: # Improving Branch Handling #### What is the problem? #### Traditional CPUs: - Branch-prediction is used to predict the most likely set of instructions - Correct branch prediction keeps the execution pipelines full - A mispredicted branch flushes the pipeline with a large penalty #### Itanium® architecture improves branch handling: - Provide a way to minimize branches using predicates - Provide support for special branch instructions - counted loop: br.ctop, br.exit - While loop: br.wtop, br.wexit - .... ## Branch Handling - Predication - Conditional execution of instructions - When the predicate is true, the instruction is executed - When it is false, the instruction is treated as a NOP - Predication converts a control dependency into a data dependency - Predication eliminates branches in the code Traditional code: ``` if (a>b) c = c + 1 else d = d * e + f ``` Avoid branch by using predicated code ``` p1, p2 = compare(a>b) if (p1) c = c + 1 if (p2) d = d * e + f ``` - Predicate p1 set to 1 if compare is true, and to 0 if it evaluates to false - p2 is the complement of p1 #### Before: Instructions c = c + 1 and d = d \* e + f are control dependant on a<b/li> #### After: - Instruction are data dependant: - Values of p1 and p2 - They determine execution - The branch is eliminated # Traditional Architecture #### **Itanium® Architecture** Only one 'branch' will have a valid predicate and be executed. - predication provides the ability to conditionally execute instructions based on computed true/false conditions - > avoids branches - > predicated instruction either completes or is dismissed (no ops) - > predicate registers are set by compare/test instructions #### **Optimized IPF** | (p1,p2)<-cmp(r1,r2) | |---------------------| | if (p1) instr 2 | | if (p2) instr 4 | | if (p1) instr 3 | | if (p2) instr 5 | #### IPF Instructions (cont) - Instruction style is "(Pn) opcode target(s)=source(s)" - Example: - First instruction only: - P4 controls whether or not the results are kept or discarded - the result registers are predicate registers P7 and P12 - R37 is compared for equality with R52 - If equal: P7 is set to 1 and P12 is set to 0. - If not equal: P7 is set to 0 and P12 is set to 1. - Combination of three instructions show how an if-then-else might be coded. # Reducing Memory Access Cost ## Reducing Memery Access Cost - Itanium® architecture eliminates many memory accesses through: - large register files to manage work in progress - better control of the memory hierarchy (cache hints) - Itanium® architecture reduces remaining memory accesses by: - moving load instructions earlier in the code - Data speculation advance a load before a possible data dependency - Control speculation speculative load before its guarding branch - -> allows early execution of loads to hide latency - -> enables the processor to bring in the data in time - -> avoids stalling the processor ## Data Speculation - allows early execution of loads to hide latency - advance load before a possible data dependency (load before store) - speculative load before a branch that guards it Memory latency can be responsible for 60% or more of processor stalls April 21, 2004 ## Data Speculation - allows early execution of loads to hide latency - advance load before a possible data dependency (load before store) typical optimized IPF reschedule load.a store load store chk.a recover recovery - support for data speculation > ALAT (advanced load address table) hardware structure that contains information about outstanding advanced loads - > advanced loads: ld.a - > check loads: ld.c - > advance load checks: chk.a - > speculative advanced loads: ld.sa Latency can be responsible for 60% or more of processor stalls April 21, 2004 51 ## **Control Speculation** - allows early execution of loads to hide latency - speculative load before a branch that guards it #### support for control speculation - ➤ NaT (Not a Thing) bit 65<sup>th</sup> bit of GR, set on incorrect speculation instead of faulting - > NaT bit propagated in computations - > speculation check: chk.s - > speculative load: ld.s speculation hides memory latency #### Massive Memory Resources - Physical memory - Full implementation will address 16 EB of physical memory (2<sup>64</sup>) - 16,000,000,000GB - Itanium® architecture microprocessor has 44-bit address bus - 16TB (16,000GB) physical memory addressable - Itanium2® architecture microprocessors have 50-bit address bus - Virtual memory - Itanium® architecture microprocessor uses 50-bits - Itanium2® architecture microprocessors uses 64-bits # Supporting Modular Code #### Procedure Call Overhead - Modular programs create more overhead - Programs tend to be call intensive - Register space shared by caller and callee - Call/Returns require register save/restores - Frequent memory access - Limitations due to resource shortage - Itanium® solution - Massive register resources - Renaming, rotating - Integer registers stackable - Register Stack Engine (RSE) - Eliminates memory accesses - Allows to allocate local registers dynamically #### Register Stack - The general register stack is divided into two subsets: - Static: 32 permanent registers (r0r31) - visible to all procedures - Used for global variables - Stacked: 96 other registers are like a stack - procedure code allocates up to 96 registers for a frame - previous frame is hidden - first register is renamed to logical register r32 - small frames eliminate/reduce saving/restoring registers to/from memory Procedure A Procedure B ## Register Stack Engine (RSE) When a procedure is called - New frame o Global available s is m Registers Caller's regis n registers, nt ren invisible and ed procedure RSE If deep nesting I registers the RSE will save of hi registers to local memory to fre burce ller's On return to ter content automatically RSE works in k stacked zing unused memory band Registers Activity not visible to application programs April 21, 2004 #### Procedure Call Overhead #### **Traditional** Procedure A call B Procedure B save current register state restore previous register state return... #### Itanium® Architecture Procedure A call B Procedure B alloc, no save! . . . no restore! (remap) return ## Register Stack Engine (RSE) # Loop Optimization Overhead - Enhance loop performance: - Done by unrolling loops - Causes code expansion - Prologue/epilogue add to code size - Itanium® solution - Software pipelining - Architecture support - Minimal prologue/epilogue code - Predication - Loop control registers (LC, EC) - Loop branches (br.ctop, br.wtop) # Software Pipelining - Multiple iterations execute in parallel - ILP Maximized - Different iteration stages execute in parallel - Execution load is balanced # Software Pipelining | iteration<br>1 | iteration<br>2 | iteration<br>3 | iteration<br>4 | iteration<br>5 | cycle | |----------------|----------------|----------------|----------------|----------------|-------| | ld4 | | | | | X | | | ld4 | | | | X+1 | | add | | ld4 | | | X+2 | | st4 | add | | ld4 | | X+3 | | | st4 | add | | ld4 | X+4 | | | | st4 | add | | X+5 | | | | | st4 | add | X+6 | | | | | | st4 | X+7 | Prolog Epilog April 21, 2004 #### Architecture Limits – EPIC Solutions Today's Limits: complexity of multiple pipelines too great to allow effective on-chip scheduling for parallel operation → Solution: explicit parallelism Compiler handles Scheduling and communicates this to the chip Today's Limit: number of registers on chip limits parallelism → Solution: quadruple registers from 32 to 128 and increasing addressing from 5 bits to 7 Today's Limit: Large (and growing) memory latency → Solution: speculative loads Today's Limit: conditional and/or unpredictable branches → Solution: prediction and predication orchestrated by the compiler ## Itanium(r) Achitecture Training April 21, 2004 #### Itanium® Architecture Training The following classes can be taken online or can be downloaded: - <u>Getting Software Ready for Intel® Itanium®</u> <u>Architecture</u> - Introducing the Intel® Itanium® Architecture - <u>Using the Intel® Itanium® Processor Instruction</u> <u>Set</u> - Intel(r) Software College - https://shale.intel.com/softwarecollege/CourseCatalog. asp?CatName=PROCESSORS