1 Shader Performance Analysis on a Modern GPU Architecture Victor Moya, Carlos González, Jordi...

31
1 Shader Performance Shader Performance Analysis on a Modern Analysis on a Modern GPU Architecture GPU Architecture Victor Moya, Carlos Victor Moya, Carlos González, González, Jordi Roca, Agustín Jordi Roca, Agustín Fernández Fernández Department of Computer Department of Computer Architecture UPC Architecture UPC Roger Espasa Roger Espasa Intel DEG Intel DEG Barcelona Barcelona
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    220
  • download

    0

Transcript of 1 Shader Performance Analysis on a Modern GPU Architecture Victor Moya, Carlos González, Jordi...

Page 1: 1 Shader Performance Analysis on a Modern GPU Architecture Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Jordi Roca, Agustín Fernández Department.

11

Shader Performance Shader Performance Analysis on a Modern GPU Analysis on a Modern GPU

ArchitectureArchitectureVictor Moya, Carlos González,Victor Moya, Carlos González,

Jordi Roca, Agustín FernándezJordi Roca, Agustín FernándezDepartment of Computer Department of Computer

Architecture UPCArchitecture UPC

Roger EspasaRoger EspasaIntel DEGIntel DEGBarcelonaBarcelona

Page 2: 1 Shader Performance Analysis on a Modern GPU Architecture Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Jordi Roca, Agustín Fernández Department.

22

IntroductionIntroduction

Shaders in GPUs evolving towards general Shaders in GPUs evolving towards general programmingprogramming Branches, generic loads, scatterBranches, generic loads, scatter

New types of shaders: geometry in DX10New types of shaders: geometry in DX10Current specialized shadersCurrent specialized shaders Area hungryArea hungry Unbalancing leads to inefficienciesUnbalancing leads to inefficiencies

This paper: unify all shadersThis paper: unify all shaders ~8% higher performance with less area & resources~8% higher performance with less area & resources

Page 3: 1 Shader Performance Analysis on a Modern GPU Architecture Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Jordi Roca, Agustín Fernández Department.

33

OutlineOutline

Attila – our GPU architectureAttila – our GPU architecture

Attila-Classic: Non-unified shadersAttila-Classic: Non-unified shaders

Attila-Unified: Unified ShadersAttila-Unified: Unified Shaders

Simulation FrameworkSimulation Framework

ResultsResults

Page 4: 1 Shader Performance Analysis on a Modern GPU Architecture Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Jordi Roca, Agustín Fernández Department.

44

OutlineOutline

Attila – our GPU architectureAttila – our GPU architecture

Attila-Classic: Non-unified shadersAttila-Classic: Non-unified shaders

Attila-Unified: Unified ShadersAttila-Unified: Unified Shaders

Simulation FrameworkSimulation Framework

ResultsResults

Page 5: 1 Shader Performance Analysis on a Modern GPU Architecture Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Jordi Roca, Agustín Fernández Department.

55

ATTILAATTILA

Our implementation of current GPUsOur implementation of current GPUs Inspired in both NVIDIA and ATIInspired in both NVIDIA and ATI Not exact to either pipelineNot exact to either pipeline

Lack of detailed micro architecture informationLack of detailed micro architecture informationEducated guessing on our sideEducated guessing on our side

Implemented FeaturesImplemented Features 2D Homogeneous Recursive Rasterization2D Homogeneous Recursive Rasterization Tiled RasterizationTiled Rasterization Hierarchical ZHierarchical Z Texture compressionTexture compression Anisotropic filteringAnisotropic filtering Depth compression, fast z/stencil and color clearDepth compression, fast z/stencil and color clear

Page 6: 1 Shader Performance Analysis on a Modern GPU Architecture Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Jordi Roca, Agustín Fernández Department.

66

OutlineOutline

Attila – our GPU architectureAttila – our GPU architecture

Attila-Classic: Non-unified shadersAttila-Classic: Non-unified shaders

Attila-Unified: Unified ShadersAttila-Unified: Unified Shaders

Simulation FrameworkSimulation Framework

ResultsResults

Page 7: 1 Shader Performance Analysis on a Modern GPU Architecture Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Jordi Roca, Agustín Fernández Department.

77

Vertex Shader

Vertex Shader

Vertex Shader

Vertex Shader

Primitive Assembly

Clipping

Triangle Setup

Rasterization

FragmentShader

FragmentShader

FragmentShader

FragmentShader

ROP ROP ROP ROP

HierarchicalZ

Vertex Fetch

MemoryController

MemoryController

MemoryController

MemoryController

Attila ClassicAttila Classic

SpecializedShaders

Page 8: 1 Shader Performance Analysis on a Modern GPU Architecture Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Jordi Roca, Agustín Fernández Department.

88

Specialized Shader IssuesSpecialized Shader Issues

UnbalancingUnbalancing In fragment shading limited scenarios (typical) up to 30% of the In fragment shading limited scenarios (typical) up to 30% of the

processing power remains idle (for a GPU with 8 vertex and 4 processing power remains idle (for a GPU with 8 vertex and 4 fragment shaders)fragment shaders)

In vertex shading limited scenarios up to 70% of the processing In vertex shading limited scenarios up to 70% of the processing power remains idle.power remains idle.

Dedicated AreaDedicated Area 4 unused vertex shaders have the same processing power than 4 unused vertex shaders have the same processing power than

one 1 fragment shaderone 1 fragment shader 4 vertex shaders require 66% the area of a fragment shader4 vertex shaders require 66% the area of a fragment shader

Different DesignsDifferent Designs Increases the complexity of the micro architectureIncreases the complexity of the micro architecture Increases development and verification timeIncreases development and verification time

Page 9: 1 Shader Performance Analysis on a Modern GPU Architecture Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Jordi Roca, Agustín Fernández Department.

99

OutlineOutline

Attila – our GPU architectureAttila – our GPU architecture

Attila-Classic: Non-unified shadersAttila-Classic: Non-unified shaders

Attila-Unified: Unified ShadersAttila-Unified: Unified Shaders

Simulation FrameworkSimulation Framework

ResultsResults

Page 10: 1 Shader Performance Analysis on a Modern GPU Architecture Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Jordi Roca, Agustín Fernández Department.

1010

MemoryController

MemoryController

MemoryController

MemoryController

ROP ROP ROP ROP

Shader

Shader

Shader

Shader

Vertex Fetch

Primitive Assembly

Clipping

Triangle Setup

Rasterization

HierarchicalZ

Scheduler

Distributor

Attila UnifiedAttila Unified

UnifiedShader

Pool

Page 11: 1 Shader Performance Analysis on a Modern GPU Architecture Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Jordi Roca, Agustín Fernández Department.

1111

Unified Shader ArchitectureUnified Shader Architecture

BenefitsBenefits Unified programming modelUnified programming model

DX10/SM4 and OpenGL/GLSlang are already pushing for itDX10/SM4 and OpenGL/GLSlang are already pushing for it

The same features for all the program targetsThe same features for all the program targetsTexturing, branching, outputsTexturing, branching, outputs

Not just vertex and fragment programsNot just vertex and fragment programsDX10 => geometry shaderDX10 => geometry shaderGeneral Purpose GPU or Stream ProcessorGeneral Purpose GPU or Stream Processor

Workload balanceWorkload balanceShading resources allocated as required at any point of the Shading resources allocated as required at any point of the renderingrendering

Page 12: 1 Shader Performance Analysis on a Modern GPU Architecture Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Jordi Roca, Agustín Fernández Department.

1212

Unified Shader ArchitectureUnified Shader Architecture

CostsCosts SchedulerScheduler

Select which kind of workload must be processed Select which kind of workload must be processed nextnext

Partly implemented with multithreading in the Partly implemented with multithreading in the fragment shader to hide texture access latencyfragment shader to hide texture access latency

Larger instruction memory and constant bankLarger instruction memory and constant bank Rerouting requiredRerouting required

All the paths cross the shader poolAll the paths cross the shader pool

Page 13: 1 Shader Performance Analysis on a Modern GPU Architecture Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Jordi Roca, Agustín Fernández Department.

1313

OutlineOutline

Attila – our GPU architectureAttila – our GPU architecture

Attila-Classic: Non-unified shadersAttila-Classic: Non-unified shaders

Attila-Unified: Unified ShadersAttila-Unified: Unified Shaders

Simulation FrameworkSimulation Framework

ResultsResults

Page 14: 1 Shader Performance Analysis on a Modern GPU Architecture Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Jordi Roca, Agustín Fernández Department.

1414

ATTILA FrameworkATTILA FrameworkOpenGL Interceptor toolOpenGL Interceptor tool

OpenGL library for Attila GPUOpenGL library for Attila GPU

Driver for our Attila GPUDriver for our Attila GPU

Attila GPU simulatorAttila GPU simulator

Signal Visualizer ToolSignal Visualizer Tool

Page 15: 1 Shader Performance Analysis on a Modern GPU Architecture Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Jordi Roca, Agustín Fernández Department.

1515

Collect Verify Simulate Analyze

OpenGL Application

GLInterceptor

Vendor OpenGL Driver

Trace

ATI R520/NVidia G70

Framebuffer

Vendor OpenGL Driver

ATI R520/NVidia G70

Framebuffer

ATTILA OpenGL Driver

ATTILA Simulator

Framebuffer

GLPlayer

Signal Visualizer

Statistics

Signal Traffic

CHECK! CHECK!

Page 16: 1 Shader Performance Analysis on a Modern GPU Architecture Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Jordi Roca, Agustín Fernández Department.

1616

Collect Verify Simulate Analyze

OpenGL Application

GLInterceptor

Vendor OpenGL Driver

Trace

ATI R520/NVidia G70

Framebuffer

Vendor OpenGL Driver

ATI R520/NVidia G70

Framebuffer

ATTILA OpenGL Driver

ATTILA Simulator

Framebuffer

GLPlayer

Signal Visualizer

Statistics

Signal Traffic

CHECK! CHECK!

GLInterceptor

•Capture a trace of OpenGL API alls from a real game

Page 17: 1 Shader Performance Analysis on a Modern GPU Architecture Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Jordi Roca, Agustín Fernández Department.

1717

Collect Verify Simulate Analyze

OpenGL Application

GLInterceptor

Vendor OpenGL Driver

Trace

ATI R520/NVidia G70

Framebuffer

Vendor OpenGL Driver

ATI R520/NVidia G70

Framebuffer

ATTILA OpenGL Driver

ATTILA Simulator

Framebuffer

GLPlayer

Signal Visualizer

Statistics

Signal Traffic

CHECK! CHECK!

GLPlayer

•Reproduce the captured trace

Page 18: 1 Shader Performance Analysis on a Modern GPU Architecture Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Jordi Roca, Agustín Fernández Department.

1818

Collect Verify Simulate Analyze

OpenGL Application

GLInterceptor

Vendor OpenGL Driver

Trace

ATI R520/NVidia G70

Framebuffer

Vendor OpenGL Driver

ATI R520/NVidia G70

Framebuffer

ATTILA OpenGL Driver

ATTILA Simulator

Framebuffer

GLPlayer

Signal Visualizer

Statistics

Signal Traffic

CHECK! CHECK!

OpenGL Library- Transforms Fixed Function into Shader code- Transforms Fixed Function into Shader code- 200 API Calls supported- 200 API Calls supported- ARB Vertex and Fragment extensions- ARB Vertex and Fragment extensions- Alpha and Fog emulated via Shader code- Alpha and Fog emulated via Shader code

DriverDriver- Low level access- Low level access- Attila memory management- Attila memory management

Page 19: 1 Shader Performance Analysis on a Modern GPU Architecture Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Jordi Roca, Agustín Fernández Department.

1919

Collect Verify Simulate Analyze

OpenGL Application

GLInterceptor

Vendor OpenGL Driver

Trace

ATI R520/NVidia G70

Framebuffer

Vendor OpenGL Driver

ATI R520/NVidia G70

Framebuffer

ATTILA OpenGL Driver

ATTILA Simulator

Framebuffer

GLPlayer

Signal Visualizer

Statistics

Signal Traffic

CHECK! CHECK!

ATTILA SimulatorATTILA Simulator- Detailed cycle-by-cycle simulation of all - Detailed cycle-by-cycle simulation of all

pipeline stagespipeline stages- 20 boxes, modeling a 100-deep pipeline- 20 boxes, modeling a 100-deep pipeline- Execute@Execute: functionality - Execute@Execute: functionality

embedded at each pipeline stageembedded at each pipeline stage

Page 20: 1 Shader Performance Analysis on a Modern GPU Architecture Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Jordi Roca, Agustín Fernández Department.

2020

Find the differences Find the differences

NVIDIA GeForce FX 5900XT Attila

Page 21: 1 Shader Performance Analysis on a Modern GPU Architecture Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Jordi Roca, Agustín Fernández Department.

2121

OutlineOutline

Attila – our GPU architectureAttila – our GPU architecture

Attila-Classic: Non-unified shadersAttila-Classic: Non-unified shaders

Attila-Unified: Unified ShadersAttila-Unified: Unified Shaders

Simulation FrameworkSimulation Framework

ResultsResults

Page 22: 1 Shader Performance Analysis on a Modern GPU Architecture Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Jordi Roca, Agustín Fernández Department.

2222

BenchmarkBenchmark

Unreal Tournament 2004Unreal Tournament 2004 Fixed function OpenGL APIFixed function OpenGL API

Vertex and fragments shaders generated by our Vertex and fragments shaders generated by our librarylibrary

1024x768 resolution1024x768 resolution 8x Anisotropic Filtering8x Anisotropic Filtering 160 of 450 frames simulated160 of 450 frames simulated 40 frames ~ 1 day simulation 40 frames ~ 1 day simulation

On a Xeon P4 @ 2.0GhzOn a Xeon P4 @ 2.0Ghz

Page 23: 1 Shader Performance Analysis on a Modern GPU Architecture Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Jordi Roca, Agustín Fernández Department.

2323

Baseline ConfigurationBaseline Configuration

Four Vertex Shaders (only for Attila- Classic)Four Vertex Shaders (only for Attila- Classic)Fragment and Unified shader configuration:Fragment and Unified shader configuration:

32 threads32 threads4 fragments/vertices per thread4 fragments/vertices per thread16 128-bit FP registers available for temporal storage per thread16 128-bit FP registers available for temporal storage per thread

n SIMD ALUsn SIMD ALUs 1 scalar ALU (optional)1 scalar ALU (optional) 1 Texture Unit per Shader Unit1 Texture Unit per Shader Unit

16 KB texture cache16 KB texture cacheSingle cycle bilinear and two cycle trilinearSingle cycle bilinear and two cycle trilinearAF up to 16x AF up to 16x

Geometry and Rasterization pipelines limited to 1 vertex and 1 Geometry and Rasterization pipelines limited to 1 vertex and 1 triangle per cycletriangle per cycleTwo ROPs: 8 z and 8 color values written per cycleTwo ROPs: 8 z and 8 color values written per cycleFour 64-bit DDR buses: peak bandwidth 64 bytes/cycleFour 64-bit DDR buses: peak bandwidth 64 bytes/cycle

Page 24: 1 Shader Performance Analysis on a Modern GPU Architecture Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Jordi Roca, Agustín Fernández Department.

2424

1

1,2

1,4

1,6

1,8

2

2,2

2,4

2,6

2,8

3

1-way 1-way + scalar 2-way 4-way

rela

tiv

e p

erf

orm

an

ce

2sh

4sh

6sh

8sh

““Classic” PerformanceClassic” Performance

8% improvement for 2-way8% improvement for 2-wayNear linear improvement for 4 shadersNear linear improvement for 4 shadersSublinear improvement for 6 and 8 shadersSublinear improvement for 6 and 8 shaders

Limited by memory bandwidth and latencyLimited by memory bandwidth and latency

8sh

6sh

4sh

2sh

~75%

~45%

~40%

7%

8%

Page 25: 1 Shader Performance Analysis on a Modern GPU Architecture Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Jordi Roca, Agustín Fernández Department.

2525

Vertex shader and fragment shader workload for 4 vertex shader units and 2 fragment shader units

Frame 330 – Detailed ZoomFrame 330 – Detailed Zoom

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1

1 101 201 301 401 501 601 701 801 901

Time (10K cycles steps)

Uti

liza

tio

n

Vertex Shader

Fragment Shader

Vertex shading limited

Page 26: 1 Shader Performance Analysis on a Modern GPU Architecture Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Jordi Roca, Agustín Fernández Department.

2626

1

1,2

1,4

1,6

1,8

2

2,2

2,4

2,6

2,8

3

1-way 1-way + scalar 2-way 4-way

rela

tive

per

form

ance

2sh

uni2sh

4sh

uni4sh

6sh

uni6sh

8sh

uni8sh

Unified Shader PerformanceUnified Shader Performance

Unified improvement ranges from 1% (2 shaders) to 8% (eight 1-way shaders)

Fragment shading limited Vertex fetch limited Geometry pipeline limited

8sh

6sh

4sh

2sh

Page 27: 1 Shader Performance Analysis on a Modern GPU Architecture Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Jordi Roca, Agustín Fernández Department.

2727

Area EstimationArea EstimationATI R400ATI R400 ATI RV400ATI RV400

Transistors (millions)Transistors (millions) 160160 120120

Vertex ShadersVertex Shaders 66 44

Fragment ShadersFragment Shaders 44 22

Hardware ElementHardware Element

Estimated AreaEstimated Area

Millions of TransistorsMillions of Transistors

Vertex ShaderVertex Shader 2.52.5

Fragment ShaderFragment Shader 1515

Additional SIMD ALUAdditional SIMD ALU +15%+15%

Additional scalar ALUAdditional scalar ALU +5%+5%

160 – 120 = 40 = 2 vertex shader * 2.5 + 2 fragments shader * 15 + 5 (other)

Page 28: 1 Shader Performance Analysis on a Modern GPU Architecture Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Jordi Roca, Agustín Fernández Department.

2828

Shader Scaling vs TransistorsShader Scaling vs Transistors

50

70

90

110

130

150

170

30 80 130 180

MTransistors

fps

2-way

uni 2-way

1-way

uni 1-way

linear

8sh

6sh

4sh

2sh

Linear for 4 shader units, sublinear for more than 4 shader unitsLinear for 4 shader units, sublinear for more than 4 shader unitsUp to 30% more efficient per area for the unified architecture (two 1-way Up to 30% more efficient per area for the unified architecture (two 1-way shaders)shaders)

Page 29: 1 Shader Performance Analysis on a Modern GPU Architecture Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Jordi Roca, Agustín Fernández Department.

2929

ConclusionConclusion

Attila Unified architecture has better Attila Unified architecture has better performance than Attila Classic with less performance than Attila Classic with less hardwarehardware Up to 8% better performanceUp to 8% better performance 8% to 25% less area required8% to 25% less area required 10% to 30% better performance per area10% to 30% better performance per area

Up to 8% better performance for 2-way shader Up to 8% better performance for 2-way shader unitsunits160% better performance from 2 to 8 fragment 160% better performance from 2 to 8 fragment or unified shader unitsor unified shader units Memory bandwidth limited beyond 4 shadersMemory bandwidth limited beyond 4 shaders

Page 30: 1 Shader Performance Analysis on a Modern GPU Architecture Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Jordi Roca, Agustín Fernández Department.

3030

QuestionsQuestions

Page 31: 1 Shader Performance Analysis on a Modern GPU Architecture Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Jordi Roca, Agustín Fernández Department.

3131

Performance of Attila Unified vs Classic AttilaPerformance of Attila Unified vs Classic Attila

1

1,01

1,02

1,03

1,04

1,05

1,06

1,07

1,08

1,09

uni2sh uni4sh uni6sh uni8sh

rela

tiv

e p

erf

orm

an

ce

1-way

1-way + scalar

2-way

4-way