Blender Git Loki

Blender Git "temp-cycles-denoising" branch commits.

Page: 12 / 17

August 21, 2016, 04:04 (GMT)
Cycles: Write optional debug info for CUDA shadow prefiltering
August 21, 2016, 04:04 (GMT)
Cycles: Don't denoise the current tile if the user cancelled the render
August 21, 2016, 04:04 (GMT)
Cycles Denoising: Tweak shadow filtering
August 21, 2016, 04:04 (GMT)
Cycles: Revert to 6 bias-variance samples

The CUDA redesign commit removed the sample at h=2, but I found that this actually makes results worse.
Therefore, it's now added back.
August 21, 2016, 03:51 (GMT)
Cycles: Redesign CUDA kernels to increase denoising performance

This commit contains essentially a complete overhaul of the CUDA denoising kernels.

One of the main changes is splitting up the huge estimate_params kernel into multiple smaller ones:
- One Kernel calculates the reduced feature space transform.
- One Kernel estimates the feature bandwidths.
- One Kernel estimates bias and variance for a given global bandwidth. This kernel is executed multiple times for different global bandwidths.
- One Kernel calculates the optimal global bandwidth.

This improves UI responsiveness since the individual kernel launches are shorter.
Also, smaller kernels are always a good thing on GPUs - from register allocation to warp divergence.

The next major improvement concerns the transform - before this commit, transform loads from global memory were the main bottleneck.
First of all, it's now stored in a SoA layout instead of AoS, which makes all transform loads coalesced.
Furthermore, the transform pointer is declared as "float const* __restricted__" instead of float*, which allows NVCC to cache the transform reads. Since only the first kernel writes the transforms, this increases speed again.

The third mayor change is that the feature vector, which is used in every per-pixel loop, now is stored in shared memory.
Since the feature vector is involved in a lot of operations, this improves performance again.
On the other hand, shared memory is rather limited on Kepler and older, so even the 11 floats per thread are already a lot.
With the default "16KB shared - 48KB L1 Cache" split on a GTX780, occupancy is only 12.5% - way too low.
With "48KB shared - 16KB L1 Cache", occupancy is back up at 50%, but of course there are more cache misses - in the end, though, the benefits of having the feature vector local make up for that.

I expect the performance boost to be even higher on Maxwell and Pascal, since these have much larger shared memory and L1.
August 21, 2016, 03:51 (GMT)
Cycles: Fix wring stride in buffer accesses when the tile size plus overscan wasn't a multiple of 4
August 21, 2016, 03:50 (GMT)
Cycles: Fix compilation with enabled filter debug output
August 21, 2016, 03:38 (GMT)
Merge remote-tracking branch 'origin/master' into soc-2016-cycles_denoising

This was an extremely hacky merge with a lot of rebasing and git tricks involved, I hope it works as it's supposed to.
August 13, 2016, 03:11 (GMT)
Cycles: Fix memory leak in the denoiser
August 13, 2016, 02:58 (GMT)
Cycles: Implement the multi-frame denoising kernel

This commit changes the denoising kernel to actually use the additional frames.
The required changes are surprisingly small - one additional feature contains
the frame to which the pixel belongs, and the per-pixel loop now iterates over frames first.
August 13, 2016, 02:06 (GMT)
Cycles: Implement multi-frame denoising buffers

This commit changes the prefiltering code so that it processes all included frames.
August 13, 2016, 01:59 (GMT)
Cycles: Implement multi-frame buffer support and loading in standalone mode

This commits adds an option to the BufferParams that specifies how many frames are stored in there.
The frames share all other parameters, such as size and passes.
Frames are not stored in order - instead, the first frame is the primary frame, so that all code that uses
the RenderBuffers still works as expected, but code parts that can use the additional frames may do so.

The Standalone Denoising mode now comes with an option to specify the frame range that will be used for denoising.
When doing so, the input filename isn't an actual file, but has to contain a part of the form "%Xd" that specifies how the frame file names are formatted, where X is the length to which frames are zero-padded. That part will be replaced by the padded frame number before loading.

So far, no code actually uses the additional frames yet, that will come in the next commits.
August 13, 2016, 01:52 (GMT)
Cycles: Implement half float file output and fix flipped standalone-denoised images

Since the tonemapping task already supports both Byte and Half output,
the only needed change is to the DisplayBuffer itself.
August 13, 2016, 01:50 (GMT)
Cycles Standalone: Implement the half window option
August 9, 2016, 01:50 (GMT)
Cycles: Move denoising utility functions into a separate file
August 9, 2016, 01:50 (GMT)
Cycles: Add a few SSE utilities
August 9, 2016, 01:50 (GMT)
Cycles: Implement SSE3-optimized NLM prefiltering kernel
August 9, 2016, 01:50 (GMT)
Cycles: Move prefiltering functions into a separate file
August 9, 2016, 01:50 (GMT)
Cycles: Fix preprocessor directives around SSE3 replacement functions

The code is supposed to implement replacements for a few SSE4.1-specific functions so that they can be used with SSE3 as well.
Therefore, it was enabled when __KERNEL_SSE3__ was set, but __KERNEL_SSE4__ wasn't.

However, __KERNEL_SSE4__ is never set anywhere - the correct one is __KERNEL_SSE41__.
Because of that, the replacements were enabled for SSE4.1 and better (AVX) as well, where they're not needed, but only slow things down.
August 9, 2016, 01:50 (GMT)
Cycles: Implement SSE3-optimized denoising kernel
Tehnyt: Miika HämäläinenViimeksi päivitetty: 07.11.2014 14:18MiikaH:n Sivut a.k.a. MiikaHweb | 2003-2021