Download - UPC Trace-Level Speculative Multithreaded Architecture Carlos Molina Universitat Rovira i Virgili – Tarragona, Spain [email protected] Antonio González.

UU PP CC

Trace-Level Speculative Multithreaded ArchitectureTrace-Level Speculative

Multithreaded Architecture

Carlos Molina

Universitat Rovira i Virgili – Tarragona, Spain

[email protected]

Antonio González and Jordi Tubella

Universitat Politècnica de Catalunya – Barcelona, Spain

{antonio,jordit}@ac.upc.es

ICCD´02, Freiburg (Germany) - September 16-18, 2002

OutlineOutline

Motivation

Related Work

TSMA

Performance Results

Conclusions

MotivationMotivation

Two techniques to avoid serialization caused

by data dependences Data Value Speculation Data Value Reuse

Speculation predicts values based on past

Reuse is posible if has been done in the past

Both may be considered at two levels Instruction Level Trace Level

Trace Level ReuseTrace Level Reuse

Set of instructions can be skipped in a row

These instructions do not need to be fetched

Live input test is not easy to handle

DynamicDynamic

Trace Level ReuseTrace Level Reuse

StaticStatic

Trace Level SpeculationTrace Level Speculation

Solves live input test

Introduces penalties due to misspeculations

Two orthogonal issues microarchitecture support for trace speculation control and data speculation techniques

– prediction of initial and final points– prediction of live output values

With Live Output TestWith Live Output Test

Trace Level SpeculationTrace Level Speculation

With Live Input TestWith Live Input Test

Trace Level Speculation with Live Input Test

Trace Level Speculation with Live Input Test

Live Output Actualization & Trace Speculation

NST

ST

Miss Trace Speculation Detection & Recovery Actions

INSTRUCTION EXECUTION

NOT EXECUTED

LIVE INPUT VALIDATION & INSTRUCTION EXECUTION

BUFFERBUFFER

Trace Level Speculation with Live Output Test

Trace Level Speculation with Live Output Test

Live Output Actualization & Trace Speculation

NST

ST

Miss Trace Speculation Detection & Recovery Actions


NOT EXECUTED

LIVE OUTPUT VALIDATION

Related WorkRelated Work

Trace Level Reuse Basic blocks (Huang and Lilja, 99) General traces (González et al, 99) Traces with compiler support (Connors and Hwu, 99)

Trace Level Speculation DIVA (Austin, 99) Slipstream processors (Rotenberg et al, 99) Pre-execution (Sohi et al, 01) Precomputation (Shen et al, 01) Nearby and distant ILP (Balasubramonian et al, 01)

TSMATSMA

CacheI

EngineFetch

RenameDecode &

UnitsFunctional

PredictorBranch

SpeculationTrace

NST Reorder BufferNST Reorder Buffer

ST Reorder BufferST Reorder Buffer

NST Ld/St QueueNST Ld/St Queue

ST Ld/St QueueST Ld/St Queue

NST I WindowNST I Window

ST I WindowST I Window

Look Ahead BufferLook Ahead Buffer

EngineVerification

L1NSDCL1NSDC L2NSDCL2NSDC

L1SDCL1SDC DataCache

Register FileNST Arch.

Register FileST Arch.

Trace Speculation EngineTrace Speculation Engine

Two issues may handle to implement a trace level predictor to communicate trace speculation opportunity

Trace level predictor PC-indexed table with N entries Each entry contains

– live output values– final program counter of trace

Trace speculation communication INI_TRACE instruction Additional MOVE instrucions

Look Ahead BufferLook Ahead Buffer

First-input first-output queue

Stores instructions executed by ST

The fields of each entry are: Program Counter Operation Type: indicates memory operation Source register Id 1 & source value 1 Source register Id 2 & source value 2 Destination register Id & destination value Memory address

Verification EngineVerification Engine

Validates speculated instructions

Mantains the non-speculative state

Consumes instructions from LAB

Test is performed as follows: testing source values of Is with non-speculative state if matching, destination value of I may be updated memory operations check effective address store instructions update memory, rest update registers

Hardware required is minimal

Thread SynchronizationThread Synchronization

Handles trace misspredictions

Recovery actions involved are: Instruction execution is stopped ST structures are emptied (IW,LSQ,ROB,LAB) Speculative cache and ST register file are invalidated

Two types of synchronization Total (Occurs when NST is not executing instructions)

– Penalty due to fill again the pipeline Partial (Occurs when NST is executing instructions)

– No penalty– NST takes the role of ST

Mantains memory state speculative non speculative

Rules

ST store updates values in L1SDC only1

Traditional memory subsystem is supported

Additional and small first level cache is added to mantain memory speculative state

ST load get values from L1SDC. If not, get from NS caches2 NST store updates values and allocate space in NS caches3 NST loads get values and allocates space in NS caches4 Line replaced in L1NSDC is copied back to L2NSDC5

Memory SubsystemMemory Subsystem

L1SDCL1SDC

L1NSDCL1NSDC L2NSDCL2NSDC

Register FileRegister File

Slight modification to permit prompt execution

Register map table contains for each entry: Commited Value ROB Tag Counter

Counter field is mantained as follows: New ST instruction increases dest. register counter Counter is decreased when ST instruction is commited After trace speculation counter are no longer increased But it is decreased until reaches the value zero.

1 ST Begins Execution2 Live Output Actualization & Trace Speculation3 NST Begins Execution4 VE Validates Instructions5 NST Executes Speculated Trace6 NST Executes Some Additional Instructions7 VE Begins Verification8 VE Finishes Verification9 Live Output Actualization & Trace SpeculationNST Execution10

ST


NOT EXECUTEDLIVE OUTPUT VALIDATION

Working ExampleWorking Example

NST

VE

1 2

3 5 6

9

4 7

10

8

Experimental FrameworkExperimental Framework

Simulator Alpha version of the SimpleScalar Toolset

BenchmarksSpec95

Maximum Optimization LevelDEC C & F77 compilers with -non_shared -O5

Statistics Collected for 125 million instructionsSkipping initializations

Instruction fetch 4 instructions per cycle

Branch predictor 2048-entry bimodal predictor

Instruction issue/commit Out of order issue, 4 I´s commit per cycle, 64-entry reorder

buffer, load execute if preceding stores are known, store-load forwarding

Architected registers 32 integer and 32 FP

Functional units 4 integer ALUs, 2 load/store units, 4 FP adders, 1 integer mult/div, 1 FP mult/div

FU latency/repeat time Integer ALU 1/1, load/store 1/1, integer mult 3/, integer div 20,19,

FP adder 2/1, FP mult 4/1, FP div 12/12

Data cache 16 KB, 2-way-set associative, 32-byte block, 6-cycle miss latency

Instruction cache 16KB, direct mapped, 32 byte cache line, 6-cycle miss latency

Second Level Cache Shared Instruction & data cache, 256 KB, 4-way set associative,

32 byte block, 100-cycle miss latency

Base MicroarchitectureBase Microarchitecture

Speculation Data Cache 1 KB, direct-mapped, 8-byte block

Verification Engine Up to 8 instructions verified per cycle. Memory instructions

block verification if fail in L1. Number of additional instructions verified after average number to find an error is 8

Trace Speculation Engine History Table: 64 entries, 2-way set , 9 instances/entry

Look Ahead Buffer 128 entries

TSMA Additional StucturesTSMA Additional Stuctures

Performance EvaluationPerformance Evaluation Main objective:

trace misspeculations cause minor penalties

Traces are built following a simple rule from backward branch to backward branch minimum and maximum size of 8 and 64 respectively

Simple Trace Predictor is evaluated Stride + Context Value (history of 9)

Results provided Percentage of misspeculations Percentage of predicted instructions Speedup

Applu

Compre

ssGcc Go

Ijpeg Li

M88

ksim

Mgrid Perl

Turb3d

Vortex

A_Mean

MisspeculationsMisspeculations

100

90

80

70

60

50

40

30

20

10

0

Applu

Compre

ssGcc Go

Ijpeg Li

M88

ksim

Mgrid Perl

Turb3d

Vortex

A_Mean

Predicted InstructionsPredicted Instructions

50

40

30

20

10

0

SpeedupSpeedup

Applu

Compre

ssGcc Go

Ijpeg Li

M88

ksim

Mgrid Perl

Turb3d

Vortex

A_Mean

1.35

1.30

1.25

1.20

1.15

1.10

1.05

1.00

ConclusionsConclusions

TSMA designed to exploit trace-level speculation

Special emphasis on minimizing misspeculation penalties

Results show: architecture is tolerant to misspeculations speedup of 16% with a predictor that misses 70%

Future WorkFuture Work

Agressive trave level predictors bigger traces better value predictors

Generalization to multiple threads cascade execution

Mixing prediction & execution speculated traces do not need to be fully speculated