Blender Git Commit Log

Git Commits -> Revision fba2b77

August 21, 2016, 04:06 (GMT)
Cycles: Further improve CUDA denoising speed by redesigning the design_row

The previous algorithm was:
- Fetch buffer data into the feature vector which was in shared (faster) memory
- Use the feature vector to calculate the weight and the design_row, which was stored in local (slower) memory
- Update the Gramian matrix using the design_row

Now, the problem there is that the most expensive part in terms of memory accesses is the third step, which means that having the design_row in shared memory would be a great improvement.

However, shared memory is extremely limited - for good performance, the number of elements per thread should be odd (to avoid bank comflicts), but even going from the 11 floats that the feature vector needs to 13 already significantly hurts the occupancy.
Therefore, in order to make room for the design_row, it would be great to get rid of the feature vector.

That's the first part of the commit: By changing the order in whoch the design_row is built, the first two steps can be merged so that the design_row is constructed directly from the buffer data instead of going through the feature vector.
This has a disadvantage - the old design_row construction had an early-abort for zero weights, which was pretty common. With the new structure, that's not possible anymore. However, this is less of a problem on GPUs due to divergence - in order to save any speed, all 32 threads in the warp had to abort anyways.

Now the feature vector doesn't take up memory anymore, but the design_row is still to big - it has up to 23 elements, which is far too much.
It has a useful property, though - the first element is always one, and the last 11 elements are just the squares of the first 11. So, storing 11 floats is enough to have all information, and the squaring can be performed when the design_row is used.
Therefore, the second part of the commit adds specialized functions that accept this reduced design_row and account for these missing elements.

Commit Details:

Full Hash: fba2b77c2a12950802491c3112b3922f5805f98a
Parent Commit: 8dcf23b
Lines Changed: +126, -51

3 Modified Paths:

/intern/cycles/kernel/kernel_filter.h (+29, -36) (Diff)
/intern/cycles/kernel/kernel_filter_util.h (+27, -15) (Diff)
/intern/cycles/util/util_math_matrix.h (+70, -0) (Diff)
By: Miika HämäläinenLast update: Nov-07-2014 14:18MiikaHweb | 2003-2021