Blender Git Loki

Display:

Blender Git "temp-cycles-denoising" branch commits.

Page: 12 / 17

« Previous Page

Revision 559404d by Lukas Stockner (soc-2016-cycles_denoising, temp-cycles-denoising)

August 21, 2016, 04:04 (GMT)

Cycles: Write optional debug info for CUDA shadow prefiltering

Revision 6a67f80 by Lukas Stockner (soc-2016-cycles_denoising, temp-cycles-denoising)

August 21, 2016, 04:04 (GMT)

Cycles: Don't denoise the current tile if the user cancelled the render

Revision 8771846 by Lukas Stockner (soc-2016-cycles_denoising, temp-cycles-denoising)

August 21, 2016, 04:04 (GMT)

Cycles Denoising: Tweak shadow filtering

Revision e1a2787 by Lukas Stockner (soc-2016-cycles_denoising, temp-cycles-denoising)

August 21, 2016, 04:04 (GMT)

Cycles: Revert to 6 bias-variance samples

The CUDA redesign commit removed the sample at h=2, but I found that this actually makes results worse.
Therefore, it's now added back.

Revision 1db96fa by Lukas Stockner (soc-2016-cycles_denoising, temp-cycles-denoising)

August 21, 2016, 03:51 (GMT)

Cycles: Redesign CUDA kernels to increase denoising performance

This commit contains essentially a complete overhaul of the CUDA denoising kernels.

One of the main changes is splitting up the huge estimate_params kernel into multiple smaller ones:
- One Kernel calculates the reduced feature space transform.
- One Kernel estimates the feature bandwidths.
- One Kernel estimates bias and variance for a given global bandwidth. This kernel is executed multiple times for different global bandwidths.
- One Kernel calculates the optimal global bandwidth.

This improves UI responsiveness since the individual kernel launches are shorter.
Also, smaller kernels are always a good thing on GPUs - from register allocation to warp divergence.

The next major improvement concerns the transform - before this commit, transform loads from global memory were the main bottleneck.
First of all, it's now stored in a SoA layout instead of AoS, which makes all transform loads coalesced.
Furthermore, the transform pointer is declared as "float const* __restricted__" instead of float*, which allows NVCC to cache the transform reads. Since only the first kernel writes the transforms, this increases speed again.

The third mayor change is that the feature vector, which is used in every per-pixel loop, now is stored in shared memory.
Since the feature vector is involved in a lot of operations, this improves performance again.
On the other hand, shared memory is rather limited on Kepler and older, so even the 11 floats per thread are already a lot.
With the default "16KB shared - 48KB L1 Cache" split on a GTX780, occupancy is only 12.5% - way too low.
With "48KB shared - 16KB L1 Cache", occupancy is back up at 50%, but of course there are more cache misses - in the end, though, the benefits of having the feature vector local make up for that.

I expect the performance boost to be even higher on Maxwell and Pascal, since these have much larger shared memory and L1.

Revision 25df3ca by Lukas Stockner (soc-2016-cycles_denoising, temp-cycles-denoising)

August 21, 2016, 03:51 (GMT)

Cycles: Fix wring stride in buffer accesses when the tile size plus overscan wasn't a multiple of 4

Revision 4ab88b4 by Lukas Stockner (soc-2016-cycles_denoising, temp-cycles-denoising)

August 21, 2016, 03:50 (GMT)

Cycles: Fix compilation with enabled filter debug output

Revision c5e9fab by Lukas Stockner (soc-2016-cycles_denoising, temp-cycles-denoising)

August 21, 2016, 03:38 (GMT)

Merge remote-tracking branch 'origin/master' into soc-2016-cycles_denoising

This was an extremely hacky merge with a lot of rebasing and git tricks involved, I hope it works as it's supposed to.

Revision 98dfe6f by Lukas Stockner (soc-2016-cycles_denoising, temp-cycles-denoising)

August 13, 2016, 03:11 (GMT)

Cycles: Fix memory leak in the denoiser

Revision 2af9026 by Lukas Stockner (soc-2016-cycles_denoising, temp-cycles-denoising)

August 13, 2016, 02:58 (GMT)

Cycles: Implement the multi-frame denoising kernel

This commit changes the denoising kernel to actually use the additional frames.
The required changes are surprisingly small - one additional feature contains
the frame to which the pixel belongs, and the per-pixel loop now iterates over frames first.

Revision e020820 by Lukas Stockner (soc-2016-cycles_denoising, temp-cycles-denoising)

August 13, 2016, 02:06 (GMT)

Cycles: Implement multi-frame denoising buffers

This commit changes the prefiltering code so that it processes all included frames.

Revision 1c675f1 by Lukas Stockner (soc-2016-cycles_denoising, temp-cycles-denoising)

August 13, 2016, 01:59 (GMT)

Cycles: Implement multi-frame buffer support and loading in standalone mode

This commits adds an option to the BufferParams that specifies how many frames are stored in there.
The frames share all other parameters, such as size and passes.
Frames are not stored in order - instead, the first frame is the primary frame, so that all code that uses
the RenderBuffers still works as expected, but code parts that can use the additional frames may do so.

The Standalone Denoising mode now comes with an option to specify the frame range that will be used for denoising.
When doing so, the input filename isn't an actual file, but has to contain a part of the form "%Xd" that specifies how the frame file names are formatted, where X is the length to which frames are zero-padded. That part will be replaced by the padded frame number before loading.

So far, no code actually uses the additional frames yet, that will come in the next commits.

Revision bfffcb5 by Lukas Stockner (soc-2016-cycles_denoising, temp-cycles-denoising)

August 13, 2016, 01:52 (GMT)

Cycles: Implement half float file output and fix flipped standalone-denoised images

Since the tonemapping task already supports both Byte and Half output,
the only needed change is to the DisplayBuffer itself.

Revision 343fd70 by Lukas Stockner (soc-2016-cycles_denoising, temp-cycles-denoising)

August 13, 2016, 01:50 (GMT)

Cycles Standalone: Implement the half window option

Revision 6792499 by Lukas Stockner (soc-2016-cycles_denoising, temp-cycles-denoising)

August 9, 2016, 01:50 (GMT)

Cycles: Move denoising utility functions into a separate file

Revision 741a245 by Lukas Stockner (soc-2016-cycles_denoising, temp-cycles-denoising)

August 9, 2016, 01:50 (GMT)

Cycles: Add a few SSE utilities

Revision 95fa483 by Lukas Stockner (soc-2016-cycles_denoising, temp-cycles-denoising)

August 9, 2016, 01:50 (GMT)

Cycles: Implement SSE3-optimized NLM prefiltering kernel

Revision b7dc25c by Lukas Stockner (soc-2016-cycles_denoising, temp-cycles-denoising)

August 9, 2016, 01:50 (GMT)

Cycles: Move prefiltering functions into a separate file

Revision cf017e8 by Lukas Stockner (soc-2016-cycles_denoising, temp-cycles-denoising)

August 9, 2016, 01:50 (GMT)

Cycles: Fix preprocessor directives around SSE3 replacement functions

The code is supposed to implement replacements for a few SSE4.1-specific functions so that they can be used with SSE3 as well.
Therefore, it was enabled when __KERNEL_SSE3__ was set, but __KERNEL_SSE4__ wasn't.

However, __KERNEL_SSE4__ is never set anywhere - the correct one is __KERNEL_SSE41__.
Because of that, the replacements were enabled for SSE4.1 and better (AVX) as well, where they're not needed, but only slow things down.

Revision dba99c4 by Lukas Stockner (soc-2016-cycles_denoising, temp-cycles-denoising)

August 9, 2016, 01:50 (GMT)

Cycles: Implement SSE3-optimized denoising kernel

« Previous Page

Blender Git Loki

Valikko

Tilastot

RSS Syötteet