Blender Git Loki

Git Commits -> Revision 1db96fa

Revision 1db96fa by Lukas Stockner (soc-2016-cycles_denoising, temp-cycles-denoising)

August 21, 2016, 03:51 (GMT)

Cycles: Redesign CUDA kernels to increase denoising performance

This commit contains essentially a complete overhaul of the CUDA denoising kernels.

One of the main changes is splitting up the huge estimate_params kernel into multiple smaller ones:
- One Kernel calculates the reduced feature space transform.
- One Kernel estimates the feature bandwidths.
- One Kernel estimates bias and variance for a given global bandwidth. This kernel is executed multiple times for different global bandwidths.
- One Kernel calculates the optimal global bandwidth.

This improves UI responsiveness since the individual kernel launches are shorter.
Also, smaller kernels are always a good thing on GPUs - from register allocation to warp divergence.

The next major improvement concerns the transform - before this commit, transform loads from global memory were the main bottleneck.
First of all, it's now stored in a SoA layout instead of AoS, which makes all transform loads coalesced.
Furthermore, the transform pointer is declared as "float const* __restricted__" instead of float*, which allows NVCC to cache the transform reads. Since only the first kernel writes the transforms, this increases speed again.

The third mayor change is that the feature vector, which is used in every per-pixel loop, now is stored in shared memory.
Since the feature vector is involved in a lot of operations, this improves performance again.
On the other hand, shared memory is rather limited on Kepler and older, so even the 11 floats per thread are already a lot.
With the default "16KB shared - 48KB L1 Cache" split on a GTX780, occupancy is only 12.5% - way too low.
With "48KB shared - 16KB L1 Cache", occupancy is back up at 50%, but of course there are more cache misses - in the end, though, the benefits of having the feature vector local make up for that.

I expect the performance boost to be even higher on Maxwell and Pascal, since these have much larger shared memory and L1.

Commit Details:

Full Hash: 1db96fa89c16f8d823f084659ddc99f726544a8f
Parent Commit: 25df3ca
Lines Changed: +683, -49

6 Modified Paths:

/intern/cycles/device/device_cuda.cpp (+63, -15) (Diff)
/intern/cycles/kernel/kernels/cuda/kernel.cu (+54, -14) (Diff)
/intern/cycles/kernel/kernel_filter.h (+427, -12) (Diff)
/intern/cycles/kernel/kernel_filter_util.h (+32, -8) (Diff)
/intern/cycles/kernel/kernel_types.h (+1, -0) (Diff)
/intern/cycles/util/util_math_matrix.h (+106, -0) (Diff)

Blender Git Loki

Git Commits -> Revision 1db96fa

Commit Details:

6 Modified Paths:

Valikko

Tilastot

RSS Syötteet

Blender Git Loki

Git Commits -> Revision 1db96fa Go

Commit Details:

6 Modified Paths:

Valikko

Tilastot

RSS Syötteet

Git Commits -> Revision 1db96fa