August 21, 2016, 04:06 (GMT) |
Cycles: Use the correct bias and variance models for the least-squares fit and global bandwidth optimization The approach that is used to find the global bandwidth is: - Run the reconstruction filter for different bandwidths and estimate bias and variance - Fit analytic bias and variance models to these bandwidth-bias/variance pairs using least-squares - Minimize the MSE term (Bias^2 + Variance) analytically using the fitted models The models used in the LWR paper are: - Bias(h) = a + b*h^2 - Variance(h) = (c + d*h^(-k))/n , where (a, b, c, d) are the parameters to be fitted, h is the global bandwidth, k is the rank and n is the number of samples. Classic linear least squares is used to find a, b, c and d. Then, the paper states that MSE(h) = (Bias(h)^2 + Variance(h)) is minimal for h = (k*d / (4*b^2*n))^(1/(k+4)). Now, what is suspicious about this term is that a and c don't appear. c makes sense - after all, its contribution to the variance is independent of h. a, however, does not - after all, the Bias term is squared, so a term that depends on both h and a exists. It turns out that this minimization term is wrong for these models, but instead correct when using Bias(h) = b*h^2 (without constant offset). That model also makes intuitive sense, since the bias goes to zero as filter strength (bandwidth) does so. Similarly, the variance model should go to zero as h goes towards infinity, since infinite filter strength would eliminate all possible noise. Therefore, this commit changes the bias and variance models to not include the constant term any more. The change in result can be significant - in my test scene, the average bandwidth halved. |
August 21, 2016, 04:06 (GMT) |
Cycles: Further improve CUDA denoising speed by redesigning the design_row The previous algorithm was: - Fetch buffer data into the feature vector which was in shared (faster) memory - Use the feature vector to calculate the weight and the design_row, which was stored in local (slower) memory - Update the Gramian matrix using the design_row Now, the problem there is that the most expensive part in terms of memory accesses is the third step, which means that having the design_row in shared memory would be a great improvement. However, shared memory is extremely limited - for good performance, the number of elements per thread should be odd (to avoid bank comflicts), but even going from the 11 floats that the feature vector needs to 13 already significantly hurts the occupancy. Therefore, in order to make room for the design_row, it would be great to get rid of the feature vector. That's the first part of the commit: By changing the order in whoch the design_row is built, the first two steps can be merged so that the design_row is constructed directly from the buffer data instead of going through the feature vector. This has a disadvantage - the old design_row construction had an early-abort for zero weights, which was pretty common. With the new structure, that's not possible anymore. However, this is less of a problem on GPUs due to divergence - in order to save any speed, all 32 threads in the warp had to abort anyways. Now the feature vector doesn't take up memory anymore, but the design_row is still to big - it has up to 23 elements, which is far too much. It has a useful property, though - the first element is always one, and the last 11 elements are just the squares of the first 11. So, storing 11 floats is enough to have all information, and the squaring can be performed when the design_row is used. Therefore, the second part of the commit adds specialized functions that accept this reduced design_row and account for these missing elements. |
August 21, 2016, 04:06 (GMT) |
Cycles: Fix various issues with the denoising debug passes |
August 21, 2016, 04:05 (GMT) |
Cycles: Fix undefined filter strength when using standalone denoising |
August 21, 2016, 04:05 (GMT) |
Cycles: Fix wrong sample variance variance calculation The missed factor caused the NLM filtering of the buffer variance to essentially reduce to a simple box filter, which overblurred the buffer variance and therefore caused problems with sharp edges in the shadow buffer. |
August 21, 2016, 04:05 (GMT) |
Cycles: Fix wrong offset for feature matrix norm calculation |
August 21, 2016, 04:04 (GMT) |
Cycles: Revert to 6 bias-variance samples The CUDA redesign commit removed the sample at h=2, but I found that this actually makes results worse. Therefore, it's now added back. |
August 21, 2016, 04:04 (GMT) |
Cycles: Write optional debug info for CUDA shadow prefiltering |
August 21, 2016, 04:04 (GMT) |
Cycles: Don't denoise the current tile if the user cancelled the render |
August 21, 2016, 04:04 (GMT) |
Cycles: Add debugging option to CUDA for switching between large L1 cache or large shared memory |
August 21, 2016, 04:04 (GMT) |
Cycles Denoising: Tweak shadow filtering |
August 21, 2016, 03:51 (GMT) |
Cycles: Redesign CUDA kernels to increase denoising performance This commit contains essentially a complete overhaul of the CUDA denoising kernels. One of the main changes is splitting up the huge estimate_params kernel into multiple smaller ones: - One Kernel calculates the reduced feature space transform. - One Kernel estimates the feature bandwidths. - One Kernel estimates bias and variance for a given global bandwidth. This kernel is executed multiple times for different global bandwidths. - One Kernel calculates the optimal global bandwidth. This improves UI responsiveness since the individual kernel launches are shorter. Also, smaller kernels are always a good thing on GPUs - from register allocation to warp divergence. The next major improvement concerns the transform - before this commit, transform loads from global memory were the main bottleneck. First of all, it's now stored in a SoA layout instead of AoS, which makes all transform loads coalesced. Furthermore, the transform pointer is declared as "float const* __restricted__" instead of float*, which allows NVCC to cache the transform reads. Since only the first kernel writes the transforms, this increases speed again. The third mayor change is that the feature vector, which is used in every per-pixel loop, now is stored in shared memory. Since the feature vector is involved in a lot of operations, this improves performance again. On the other hand, shared memory is rather limited on Kepler and older, so even the 11 floats per thread are already a lot. With the default "16KB shared - 48KB L1 Cache" split on a GTX780, occupancy is only 12.5% - way too low. With "48KB shared - 16KB L1 Cache", occupancy is back up at 50%, but of course there are more cache misses - in the end, though, the benefits of having the feature vector local make up for that. I expect the performance boost to be even higher on Maxwell and Pascal, since these have much larger shared memory and L1. |
August 21, 2016, 03:51 (GMT) |
Cycles: Fix wring stride in buffer accesses when the tile size plus overscan wasn't a multiple of 4 |
August 21, 2016, 03:50 (GMT) |
Cycles: Fix compilation with enabled filter debug output |
August 21, 2016, 03:38 (GMT) |
Merge remote-tracking branch 'origin/master' into soc-2016-cycles_denoising This was an extremely hacky merge with a lot of rebasing and git tricks involved, I hope it works as it's supposed to. |
August 20, 2016, 23:04 (GMT) |
removed chrono calls in mantaflow file. they caused some trouble with the blender build and are not needed anyways. |
August 19, 2016, 22:54 (GMT) |
Curves: GSoC 2016 - Bezier curve improvements Added docstrings to all functions. |
August 19, 2016, 22:42 (GMT) |
WIP packing: concave support for iterative solution search |
August 19, 2016, 22:22 (GMT) |
fix for viewport switcher: don't hide it when cache is baked |
August 19, 2016, 21:31 (GMT) |
Merge branch 'master' into soc-2016-layer_manager Conflicts: source/blender/blenkernel/intern/depsgraph.c source/blender/blenkernel/intern/object_dupli.c source/blender/blenloader/intern/versioning_270.c source/blender/modifiers/intern/MOD_cloth.c source/blender/modifiers/intern/MOD_dynamicpaint.c source/blender/modifiers/intern/MOD_smoke.c |
|
|
|


Master Commits
MiikaHweb | 2003-2021