08 Dkawamo2 JPEG Presentation

download 08 Dkawamo2 JPEG Presentation

of 22

Transcript of 08 Dkawamo2 JPEG Presentation

  • 7/31/2019 08 Dkawamo2 JPEG Presentation

    1/22

    CUDA JPEG Essentials

    Darek Kawamoto

  • 7/31/2019 08 Dkawamo2 JPEG Presentation

    2/22

    Introduction

    Project Origin

    Inverse Discrete Cosine Transform

    Kernel Summary

    Performance

    Parallel Huffman Decode for JPEG

    Design Approach

    Design Problems, Solutions

    Implementation Remarks

    Conclusion

  • 7/31/2019 08 Dkawamo2 JPEG Presentation

    3/22

    Project Origin

    Computer Animation for Scientific Visualization

    Stuart Levy, UIUC / NCSA

    Goal: Decode big (1920x1080) JPEG images fast (~30 fps)

    GPU cheaper than specialized hardware CUDA and Two JPEG Bottlenecks:

    Inverse Discrete Cosine Transform (IDCT)

    Straightforward, similar to class machine problems

    Huffman Decode Stage

    Tricky parallelization of a serial process

  • 7/31/2019 08 Dkawamo2 JPEG Presentation

    4/22

    Inverse Discrete Cosine Transform

    2-D IDCT:

    1-D IDCT:

    2-D is equivalent to 1-D applied in each direction

    Kernel uses 1-D transforms

    pxy=1

    4i= 0

    7

    j= 0

    7

    Ci Cj Gij cos2x 1 i

    16cos

    2x 1 j

    16

    WhereCf=1

    2when f= 0, 1otherwise

    px

    =

    1

    2i= 0

    7

    Ci

    Gi

    cos2x 1 i

    16

  • 7/31/2019 08 Dkawamo2 JPEG Presentation

    5/22

    IDCT Kernel

    Thread Parallelism Each thread corresponds to an element of the matrix

    Threads compute IDCT across columns, then rows

    Memory Access Patterns Shared memory: broadcast, or no bank conflicts

    Global memory: buffered, coalesced

    Other Optimizations Careful use of 16KB Shared Memory: 6 blocks per SMP

    Unrolled 5x: Each iteration computes five 2-D IDCTs

  • 7/31/2019 08 Dkawamo2 JPEG Presentation

    6/22

    IDCT Performance -- How...?

    How to benchmark? libJPEG: executes processes serially

    GPU: executes IDCT process wholesale

    How precise? short implementations do almost as well as float

    double precision has no advantages

    How much work? GPU shines with > 64,000 blocks

    JPEG specific: CPU can short circuit vectors of zeros

    Let CPU short circuit ~50% of columns in the first IDCT

  • 7/31/2019 08 Dkawamo2 JPEG Presentation

    7/22

    IDCT Performance -- Cost

    IDCT Implementations:

    (float) Nave 1-D

    64 Multiplies and 64 Adds per 1-D transform

    (short) Chen-Wang 11 Multiplies, 29 Adds per 1-D transform

    (float) Arai, Agui, and Nakajima (AA&N)

    5 Multiplies, 29 Adds per 1-D transform

    Other multiplies folded into de-quantization tables

  • 7/31/2019 08 Dkawamo2 JPEG Presentation

    8/22

    IDCT Performance -- Small

    Approx. Execution times for 67,200 blocks:

    (float) Nave 1-D GPU 4.69 ms

    (float) Nave 1-D CPU Serial 333 ms (71x)

    (float) Nave 1-D CPU Wholesale 100 ms (21x)

    (short) Chen-Wang Serial 30 ms (6.4x)

    (float) AA&N Wholesale 25 ms (5.3x)

    (float) AA&N Serial 268 ms (57x)

    GPU: ~29 GFLOPS

  • 7/31/2019 08 Dkawamo2 JPEG Presentation

    9/22

    IDCT Performance -- Big

    Approx. Execution times for 245,760 blocks:

    (float) Nave 1-D GPU 16.94 ms

    (float) Nave 1-D CPU Serial 1250 ms (73x)

    (float) Nave 1-D CPU Wholesale 375 ms (22x)

    (short) Chen-Wang Serial 113 ms (6.7x)

    (float) AA&N Wholesale 91 ms (5.4x)

    (float) AA&N Serial 1000 ms (59x)

    GPU: ~30 GFLOPS

  • 7/31/2019 08 Dkawamo2 JPEG Presentation

    10/22

    IDCT Performance Conclusion

    Amount Wholesale transforms work much better than retail

    67,200 IDCT blocks performs almost as well as 245,760

    Speed 30 fps means each frame needs to be ready in 33 ms

    How much time to perform the other JPEG functions?

    With 67,200 blocks, we have 28.6 ms left

    With 245,760 blocks, we have 16.4 ms left

    Conclusion

    Could not previously hope to process in < 33 ms

    Application now depends on the speedup ofother kernels

  • 7/31/2019 08 Dkawamo2 JPEG Presentation

    11/22

    Parallel Huffman Decode for JPEG

    Huffman Compression Prefix-free, variable length code

    Serial in nature: decode each symbol in sequential order

    Parallel Decoding Challenge Impossible to determine where symbols start and end

    without decoding all previous symbols

    Design Approach Start decoding in the middle of the stream at several places,combine results when synchronization occurs, and throw out

    all extra work

  • 7/31/2019 08 Dkawamo2 JPEG Presentation

    12/22

    Design Approach

    Spawn parallel work threads

  • 7/31/2019 08 Dkawamo2 JPEG Presentation

    13/22

    Parallel Huffman Decode for JPEG

    Design Approach Start decoding in the middle of the stream at several places,

    combine results when synchronization occurs, and throw out

    all extra work

    Problems Does the parallel speedup of successful synchronization

    offset the penalty of extra work?

    Yes, if we choose our work wisely! Do so by exploiting JPEG structure and probability

    Each decoder thread doesn't know how much data it willdecode

    Allocate memory on device using atomic functions

  • 7/31/2019 08 Dkawamo2 JPEG Presentation

    14/22

    Choosing Work Wisely

    Exploit Block Coding Each block of coefficients encodes a DC coefficient and

    assorted AC coefficients

    Due to quantization and coding scheme, it's likely a blockwill end with a End of Block (EOB) symbol

    If the EOB block symbol is 4 or more symbols and can'tprefix itself, probability of random occurrence is 1/16

    In regions where we want to start a parallel decode thread,only start after possible EOB blocks

    Can use any symbol to attempt synchronization, EOB is

    arbitrary, but practical because DC coefficient is codeddifferently

  • 7/31/2019 08 Dkawamo2 JPEG Presentation

    15/22

    New Approach

    Suppose EOB = 0101

  • 7/31/2019 08 Dkawamo2 JPEG Presentation

    16/22

    EOB Overhead

    Overhead associated with finding EOB symbols Implemented a kernel to do so, < 1 ms

    Effectiveness depends on block length statistics

    If we guarantee a true EOB hit in each section of stream welook at, then we guarantee synchronization with that section

    If we do not guarantee synchronization, some threads may

    have to decode multiple sections

    Research on these statistics necessary to make appropriatedesign decisions that maximize the probability of EOB hitswhile minimizing the amount of false hits

  • 7/31/2019 08 Dkawamo2 JPEG Presentation

    17/22

    Decode Synchronization

    Each decoder thread maintains information Where it started (bit it first looked at)

    Length and Data of Decoder output

    Where it is (the bit it currently looks at) Synchronization occurs when the current thread location

    matches another's start location

    Problem What happens to false EOB hits... do they synchronize?

    How does each decoder thread know how much data it willdecode? How do we allocate memory for each thread?

  • 7/31/2019 08 Dkawamo2 JPEG Presentation

    18/22

    Synchronization

    Problem What happens to false EOB hits... do they synchronize?

    Answer

    In general, yes they do. After several hundred bits, theysynchronize with the real stream and will end at the next

    parallel section

    Experiments in Klein and Wiseman, Parallel Huffman

    Decoding with Applications to JPEG Files (2003) Can use advanced logic to prevent false hitting decoder

    threads from doing too much work

  • 7/31/2019 08 Dkawamo2 JPEG Presentation

    19/22

    Memory Allocation

    Problem How does each decoder thread know how much data it will

    decode? How do we allocate memory for each thread?

    Solution Store decoder output in chunks of global memory

    Use atomic functions to acquire locks on chunks

    Requires compute capability 1.1 (G92s)

  • 7/31/2019 08 Dkawamo2 JPEG Presentation

    20/22

    Putting it All Together

    After all decoder threads have finished We figure out which threads did meaningful work

    Chain the decoded data together to create the output

    Makes use of the decoder thread information Clear out the scratch space (memory chunks)

    Throw away all of the extra work

  • 7/31/2019 08 Dkawamo2 JPEG Presentation

    21/22

  • 7/31/2019 08 Dkawamo2 JPEG Presentation

    22/22

    Conclusion

    IDCT Kernel speedup (5-59x) based on context

    Because of the serial nature of JPEG, applications

    often do not make use of wholesale transforms

    Parallel Huffman Decoding is ComplexIs now the main bottleneck of JPEG decompression

    Lots of potential speedup to be had, but requires

    careful and precise research and development 30 frame per second high-res JPEG Animation

    Possible and probable, with additional work