UU PP CC
Trace-Level Speculative Multithreaded ArchitectureTrace-Level Speculative
Multithreaded Architecture
Carlos Molina
Universitat Rovira i Virgili – Tarragona, Spain
Antonio González and Jordi Tubella
Universitat Politècnica de Catalunya – Barcelona, Spain
{antonio,jordit}@ac.upc.es
ICCD´02, Freiburg (Germany) - September 16-18, 2002
OutlineOutline
Motivation
Related Work
TSMA
Performance Results
Conclusions
MotivationMotivation
Two techniques to avoid serialization caused
by data dependences Data Value Speculation Data Value Reuse
Speculation predicts values based on past
Reuse is posible if has been done in the past
Both may be considered at two levels Instruction Level Trace Level
Trace Level ReuseTrace Level Reuse
Set of instructions can be skipped in a row
These instructions do not need to be fetched
Live input test is not easy to handle
DynamicDynamic
Trace Level ReuseTrace Level Reuse
StaticStatic
Trace Level SpeculationTrace Level Speculation
Solves live input test
Introduces penalties due to misspeculations
Two orthogonal issues microarchitecture support for trace speculation control and data speculation techniques
– prediction of initial and final points– prediction of live output values
With Live Output TestWith Live Output Test
Trace Level SpeculationTrace Level Speculation
With Live Input TestWith Live Input Test
Trace Level Speculation with Live Input Test
Trace Level Speculation with Live Input Test
Live Output Actualization & Trace Speculation
NST
ST
Miss Trace Speculation Detection & Recovery Actions
INSTRUCTION EXECUTION
NOT EXECUTED
LIVE INPUT VALIDATION & INSTRUCTION EXECUTION
BUFFERBUFFER
Trace Level Speculation with Live Output Test
Trace Level Speculation with Live Output Test
Live Output Actualization & Trace Speculation
NST
ST
Miss Trace Speculation Detection & Recovery Actions
INSTRUCTION EXECUTION
NOT EXECUTED
LIVE OUTPUT VALIDATION
Related WorkRelated Work
Trace Level Reuse Basic blocks (Huang and Lilja, 99) General traces (González et al, 99) Traces with compiler support (Connors and Hwu, 99)
Trace Level Speculation DIVA (Austin, 99) Slipstream processors (Rotenberg et al, 99) Pre-execution (Sohi et al, 01) Precomputation (Shen et al, 01) Nearby and distant ILP (Balasubramonian et al, 01)
TSMATSMA
CacheI
EngineFetch
RenameDecode &
UnitsFunctional
PredictorBranch
SpeculationTrace
NST Reorder BufferNST Reorder Buffer
ST Reorder BufferST Reorder Buffer
NST Ld/St QueueNST Ld/St Queue
ST Ld/St QueueST Ld/St Queue
NST I WindowNST I Window
ST I WindowST I Window
Look Ahead BufferLook Ahead Buffer
EngineVerification
L1NSDCL1NSDC L2NSDCL2NSDC
L1SDCL1SDC DataCache
Register FileNST Arch.
Register FileST Arch.
Trace Speculation EngineTrace Speculation Engine
Two issues may handle to implement a trace level predictor to communicate trace speculation opportunity
Trace level predictor PC-indexed table with N entries Each entry contains
– live output values– final program counter of trace
Trace speculation communication INI_TRACE instruction Additional MOVE instrucions
Look Ahead BufferLook Ahead Buffer
First-input first-output queue
Stores instructions executed by ST
The fields of each entry are: Program Counter Operation Type: indicates memory operation Source register Id 1 & source value 1 Source register Id 2 & source value 2 Destination register Id & destination value Memory address
Verification EngineVerification Engine
Validates speculated instructions
Mantains the non-speculative state
Consumes instructions from LAB
Test is performed as follows: testing source values of Is with non-speculative state if matching, destination value of I may be updated memory operations check effective address store instructions update memory, rest update registers
Hardware required is minimal
Thread SynchronizationThread Synchronization
Handles trace misspredictions
Recovery actions involved are: Instruction execution is stopped ST structures are emptied (IW,LSQ,ROB,LAB) Speculative cache and ST register file are invalidated
Two types of synchronization Total (Occurs when NST is not executing instructions)
– Penalty due to fill again the pipeline Partial (Occurs when NST is executing instructions)
– No penalty– NST takes the role of ST
Mantains memory state speculative non speculative
Rules
ST store updates values in L1SDC only1
Traditional memory subsystem is supported
Additional and small first level cache is added to mantain memory speculative state
ST load get values from L1SDC. If not, get from NS caches2 NST store updates values and allocate space in NS caches3 NST loads get values and allocates space in NS caches4 Line replaced in L1NSDC is copied back to L2NSDC5
Memory SubsystemMemory Subsystem
L1SDCL1SDC
L1NSDCL1NSDC L2NSDCL2NSDC
Register FileRegister File
Slight modification to permit prompt execution
Register map table contains for each entry: Commited Value ROB Tag Counter
Counter field is mantained as follows: New ST instruction increases dest. register counter Counter is decreased when ST instruction is commited After trace speculation counter are no longer increased But it is decreased until reaches the value zero.
1 ST Begins Execution2 Live Output Actualization & Trace Speculation3 NST Begins Execution4 VE Validates Instructions5 NST Executes Speculated Trace6 NST Executes Some Additional Instructions7 VE Begins Verification8 VE Finishes Verification9 Live Output Actualization & Trace SpeculationNST Execution10
ST
INSTRUCTION EXECUTION
NOT EXECUTEDLIVE OUTPUT VALIDATION
Working ExampleWorking Example
NST
VE
1 2
3 5 6
9
4 7
10
8
Experimental FrameworkExperimental Framework
Simulator Alpha version of the SimpleScalar Toolset
BenchmarksSpec95
Maximum Optimization LevelDEC C & F77 compilers with -non_shared -O5
Statistics Collected for 125 million instructionsSkipping initializations
Instruction fetch 4 instructions per cycle
Branch predictor 2048-entry bimodal predictor
Instruction issue/commit Out of order issue, 4 I´s commit per cycle, 64-entry reorder
buffer, load execute if preceding stores are known, store-load forwarding
Architected registers 32 integer and 32 FP
Functional units 4 integer ALUs, 2 load/store units, 4 FP adders, 1 integer mult/div, 1 FP mult/div
FU latency/repeat time Integer ALU 1/1, load/store 1/1, integer mult 3/, integer div 20,19,
FP adder 2/1, FP mult 4/1, FP div 12/12
Data cache 16 KB, 2-way-set associative, 32-byte block, 6-cycle miss latency
Instruction cache 16KB, direct mapped, 32 byte cache line, 6-cycle miss latency
Second Level Cache Shared Instruction & data cache, 256 KB, 4-way set associative,
32 byte block, 100-cycle miss latency
Base MicroarchitectureBase Microarchitecture
Speculation Data Cache 1 KB, direct-mapped, 8-byte block
Verification Engine Up to 8 instructions verified per cycle. Memory instructions
block verification if fail in L1. Number of additional instructions verified after average number to find an error is 8
Trace Speculation Engine History Table: 64 entries, 2-way set , 9 instances/entry
Look Ahead Buffer 128 entries
TSMA Additional StucturesTSMA Additional Stuctures
Performance EvaluationPerformance Evaluation Main objective:
trace misspeculations cause minor penalties
Traces are built following a simple rule from backward branch to backward branch minimum and maximum size of 8 and 64 respectively
Simple Trace Predictor is evaluated Stride + Context Value (history of 9)
Results provided Percentage of misspeculations Percentage of predicted instructions Speedup
Applu
Compre
ssGcc Go
Ijpeg Li
M88
ksim
Mgrid Perl
Turb3d
Vortex
A_Mean
MisspeculationsMisspeculations
100
90
80
70
60
50
40
30
20
10
0
Applu
Compre
ssGcc Go
Ijpeg Li
M88
ksim
Mgrid Perl
Turb3d
Vortex
A_Mean
Predicted InstructionsPredicted Instructions
50
40
30
20
10
0
SpeedupSpeedup
Applu
Compre
ssGcc Go
Ijpeg Li
M88
ksim
Mgrid Perl
Turb3d
Vortex
A_Mean
1.35
1.30
1.25
1.20
1.15
1.10
1.05
1.00
ConclusionsConclusions
TSMA designed to exploit trace-level speculation
Special emphasis on minimizing misspeculation penalties
Results show: architecture is tolerant to misspeculations speedup of 16% with a predictor that misses 70%
Future WorkFuture Work
Agressive trave level predictors bigger traces better value predictors
Generalization to multiple threads cascade execution
Mixing prediction & execution speculated traces do not need to be fully speculated