August 21, 2016, 04:04 (GMT) |
Cycles: Write optional debug info for CUDA shadow prefiltering |
August 21, 2016, 04:04 (GMT) |
Cycles: Don't denoise the current tile if the user cancelled the render |
August 21, 2016, 04:04 (GMT) |
Cycles Denoising: Tweak shadow filtering |
August 21, 2016, 04:04 (GMT) |
Cycles: Revert to 6 bias-variance samples The CUDA redesign commit removed the sample at h=2, but I found that this actually makes results worse. Therefore, it's now added back. |
August 21, 2016, 03:51 (GMT) |
Cycles: Redesign CUDA kernels to increase denoising performance This commit contains essentially a complete overhaul of the CUDA denoising kernels. One of the main changes is splitting up the huge estimate_params kernel into multiple smaller ones: - One Kernel calculates the reduced feature space transform. - One Kernel estimates the feature bandwidths. - One Kernel estimates bias and variance for a given global bandwidth. This kernel is executed multiple times for different global bandwidths. - One Kernel calculates the optimal global bandwidth. This improves UI responsiveness since the individual kernel launches are shorter. Also, smaller kernels are always a good thing on GPUs - from register allocation to warp divergence. The next major improvement concerns the transform - before this commit, transform loads from global memory were the main bottleneck. First of all, it's now stored in a SoA layout instead of AoS, which makes all transform loads coalesced. Furthermore, the transform pointer is declared as "float const* __restricted__" instead of float*, which allows NVCC to cache the transform reads. Since only the first kernel writes the transforms, this increases speed again. The third mayor change is that the feature vector, which is used in every per-pixel loop, now is stored in shared memory. Since the feature vector is involved in a lot of operations, this improves performance again. On the other hand, shared memory is rather limited on Kepler and older, so even the 11 floats per thread are already a lot. With the default "16KB shared - 48KB L1 Cache" split on a GTX780, occupancy is only 12.5% - way too low. With "48KB shared - 16KB L1 Cache", occupancy is back up at 50%, but of course there are more cache misses - in the end, though, the benefits of having the feature vector local make up for that. I expect the performance boost to be even higher on Maxwell and Pascal, since these have much larger shared memory and L1. |
August 21, 2016, 03:51 (GMT) |
Cycles: Fix wring stride in buffer accesses when the tile size plus overscan wasn't a multiple of 4 |
August 21, 2016, 03:50 (GMT) |
Cycles: Fix compilation with enabled filter debug output |
August 21, 2016, 03:38 (GMT) |
Merge remote-tracking branch 'origin/master' into soc-2016-cycles_denoising This was an extremely hacky merge with a lot of rebasing and git tricks involved, I hope it works as it's supposed to. |
August 13, 2016, 03:11 (GMT) |
Cycles: Fix memory leak in the denoiser |
August 13, 2016, 02:58 (GMT) |
Cycles: Implement the multi-frame denoising kernel This commit changes the denoising kernel to actually use the additional frames. The required changes are surprisingly small - one additional feature contains the frame to which the pixel belongs, and the per-pixel loop now iterates over frames first. |
August 13, 2016, 02:06 (GMT) |
Cycles: Implement multi-frame denoising buffers This commit changes the prefiltering code so that it processes all included frames. |
August 13, 2016, 01:59 (GMT) |
Cycles: Implement multi-frame buffer support and loading in standalone mode This commits adds an option to the BufferParams that specifies how many frames are stored in there. The frames share all other parameters, such as size and passes. Frames are not stored in order - instead, the first frame is the primary frame, so that all code that uses the RenderBuffers still works as expected, but code parts that can use the additional frames may do so. The Standalone Denoising mode now comes with an option to specify the frame range that will be used for denoising. When doing so, the input filename isn't an actual file, but has to contain a part of the form "%Xd" that specifies how the frame file names are formatted, where X is the length to which frames are zero-padded. That part will be replaced by the padded frame number before loading. So far, no code actually uses the additional frames yet, that will come in the next commits. |
August 13, 2016, 01:52 (GMT) |
Cycles: Implement half float file output and fix flipped standalone-denoised images Since the tonemapping task already supports both Byte and Half output, the only needed change is to the DisplayBuffer itself. |
August 13, 2016, 01:50 (GMT) |
Cycles Standalone: Implement the half window option |
August 9, 2016, 01:50 (GMT) |
Cycles: Move denoising utility functions into a separate file |
August 9, 2016, 01:50 (GMT) |
Cycles: Add a few SSE utilities |
August 9, 2016, 01:50 (GMT) |
Cycles: Implement SSE3-optimized NLM prefiltering kernel |
August 9, 2016, 01:50 (GMT) |
Cycles: Move prefiltering functions into a separate file |
August 9, 2016, 01:50 (GMT) |
Cycles: Fix preprocessor directives around SSE3 replacement functions The code is supposed to implement replacements for a few SSE4.1-specific functions so that they can be used with SSE3 as well. Therefore, it was enabled when __KERNEL_SSE3__ was set, but __KERNEL_SSE4__ wasn't. However, __KERNEL_SSE4__ is never set anywhere - the correct one is __KERNEL_SSE41__. Because of that, the replacements were enabled for SSE4.1 and better (AVX) as well, where they're not needed, but only slow things down. |
August 9, 2016, 01:50 (GMT) |
Cycles: Implement SSE3-optimized denoising kernel |
|