August 21, 2016, 04:06 (GMT) |
Cycles: Further improve CUDA denoising speed by redesigning the design_row The previous algorithm was: - Fetch buffer data into the feature vector which was in shared (faster) memory - Use the feature vector to calculate the weight and the design_row, which was stored in local (slower) memory - Update the Gramian matrix using the design_row Now, the problem there is that the most expensive part in terms of memory accesses is the third step, which means that having the design_row in shared memory would be a great improvement. However, shared memory is extremely limited - for good performance, the number of elements per thread should be odd (to avoid bank comflicts), but even going from the 11 floats that the feature vector needs to 13 already significantly hurts the occupancy. Therefore, in order to make room for the design_row, it would be great to get rid of the feature vector. That's the first part of the commit: By changing the order in whoch the design_row is built, the first two steps can be merged so that the design_row is constructed directly from the buffer data instead of going through the feature vector. This has a disadvantage - the old design_row construction had an early-abort for zero weights, which was pretty common. With the new structure, that's not possible anymore. However, this is less of a problem on GPUs due to divergence - in order to save any speed, all 32 threads in the warp had to abort anyways. Now the feature vector doesn't take up memory anymore, but the design_row is still to big - it has up to 23 elements, which is far too much. It has a useful property, though - the first element is always one, and the last 11 elements are just the squares of the first 11. So, storing 11 floats is enough to have all information, and the squaring can be performed when the design_row is used. Therefore, the second part of the commit adds specialized functions that accept this reduced design_row and account for these missing elements. |
August 21, 2016, 04:06 (GMT) |
Cycles: Fix various issues with the denoising debug passes |
August 21, 2016, 04:05 (GMT) |
Cycles: Fix undefined filter strength when using standalone denoising |
August 21, 2016, 04:05 (GMT) |
Cycles: Fix wrong sample variance variance calculation The missed factor caused the NLM filtering of the buffer variance to essentially reduce to a simple box filter, which overblurred the buffer variance and therefore caused problems with sharp edges in the shadow buffer. |
August 21, 2016, 04:05 (GMT) |
Cycles: Fix wrong offset for feature matrix norm calculation |
August 21, 2016, 04:04 (GMT) |
Cycles: Add debugging option to CUDA for switching between large L1 cache or large shared memory |
August 21, 2016, 04:04 (GMT) |
Cycles: Write optional debug info for CUDA shadow prefiltering |
August 21, 2016, 04:04 (GMT) |
Cycles: Don't denoise the current tile if the user cancelled the render |
August 21, 2016, 04:04 (GMT) |
Cycles Denoising: Tweak shadow filtering |
August 21, 2016, 04:04 (GMT) |
Cycles: Revert to 6 bias-variance samples The CUDA redesign commit removed the sample at h=2, but I found that this actually makes results worse. Therefore, it's now added back. |
August 21, 2016, 03:51 (GMT) |
Cycles: Redesign CUDA kernels to increase denoising performance This commit contains essentially a complete overhaul of the CUDA denoising kernels. One of the main changes is splitting up the huge estimate_params kernel into multiple smaller ones: - One Kernel calculates the reduced feature space transform. - One Kernel estimates the feature bandwidths. - One Kernel estimates bias and variance for a given global bandwidth. This kernel is executed multiple times for different global bandwidths. - One Kernel calculates the optimal global bandwidth. This improves UI responsiveness since the individual kernel launches are shorter. Also, smaller kernels are always a good thing on GPUs - from register allocation to warp divergence. The next major improvement concerns the transform - before this commit, transform loads from global memory were the main bottleneck. First of all, it's now stored in a SoA layout instead of AoS, which makes all transform loads coalesced. Furthermore, the transform pointer is declared as "float const* __restricted__" instead of float*, which allows NVCC to cache the transform reads. Since only the first kernel writes the transforms, this increases speed again. The third mayor change is that the feature vector, which is used in every per-pixel loop, now is stored in shared memory. Since the feature vector is involved in a lot of operations, this improves performance again. On the other hand, shared memory is rather limited on Kepler and older, so even the 11 floats per thread are already a lot. With the default "16KB shared - 48KB L1 Cache" split on a GTX780, occupancy is only 12.5% - way too low. With "48KB shared - 16KB L1 Cache", occupancy is back up at 50%, but of course there are more cache misses - in the end, though, the benefits of having the feature vector local make up for that. I expect the performance boost to be even higher on Maxwell and Pascal, since these have much larger shared memory and L1. |
August 21, 2016, 03:51 (GMT) |
Cycles: Fix wring stride in buffer accesses when the tile size plus overscan wasn't a multiple of 4 |
August 21, 2016, 03:50 (GMT) |
Cycles: Fix compilation with enabled filter debug output |
August 21, 2016, 03:38 (GMT) |
Merge remote-tracking branch 'origin/master' into soc-2016-cycles_denoising This was an extremely hacky merge with a lot of rebasing and git tricks involved, I hope it works as it's supposed to. |
August 13, 2016, 03:11 (GMT) |
Cycles: Fix memory leak in the denoiser |
August 13, 2016, 02:58 (GMT) |
Cycles: Implement the multi-frame denoising kernel This commit changes the denoising kernel to actually use the additional frames. The required changes are surprisingly small - one additional feature contains the frame to which the pixel belongs, and the per-pixel loop now iterates over frames first. |
August 13, 2016, 02:06 (GMT) |
Cycles: Implement multi-frame denoising buffers This commit changes the prefiltering code so that it processes all included frames. |
August 13, 2016, 01:59 (GMT) |
Cycles: Implement multi-frame buffer support and loading in standalone mode This commits adds an option to the BufferParams that specifies how many frames are stored in there. The frames share all other parameters, such as size and passes. Frames are not stored in order - instead, the first frame is the primary frame, so that all code that uses the RenderBuffers still works as expected, but code parts that can use the additional frames may do so. The Standalone Denoising mode now comes with an option to specify the frame range that will be used for denoising. When doing so, the input filename isn't an actual file, but has to contain a part of the form "%Xd" that specifies how the frame file names are formatted, where X is the length to which frames are zero-padded. That part will be replaced by the padded frame number before loading. So far, no code actually uses the additional frames yet, that will come in the next commits. |
August 13, 2016, 01:52 (GMT) |
Cycles: Implement half float file output and fix flipped standalone-denoised images Since the tonemapping task already supports both Byte and Half output, the only needed change is to the DisplayBuffer itself. |
August 13, 2016, 01:50 (GMT) |
Cycles Standalone: Implement the half window option |
|