July 16, 2021, 12:37 (GMT) |
Cycles X: Support shadow catcher behind transparent object Seems to be rather straightforward. Think initially it was less obvious to do, until we've started to count number of samples for the shadow catcher pass. Note that this is only about support of Transparent BSDF. Glass BSDF can not be supported as it refracts light, which is not possible to store as a shadow catcher pass. Differential Revision: https://developer.blender.org/D11946 |
July 16, 2021, 12:32 (GMT) |
Cycles X: Support Transparent Glass for shadow catcher Improves support of Glass BSDF in front of a shadow catcher. |
July 16, 2021, 11:44 (GMT) |
Fix fully transparent shadow catcher pass without catchers Makes it so behavior of a shadow catcher pass is always predictable: it is always possible to multiply it with a backdrop, regardless of presence of shadow catcher object in the scene. The downside it that this change makes it so extra memory is allocated to store empty shadow catcher, and will make it so denoiser will an extra work. Is possible to avoid, but it ends up in tricky checks, and the situation is unlikely to be that common to justify making code more complex. Differential Revision: https://developer.blender.org/D11945 |
July 15, 2021, 16:22 (GMT) |
Cycles X: Ignore shadow catcher from holdout collection Differential Revision: https://developer.blender.org/D11937 |
July 15, 2021, 16:19 (GMT) |
Fix wrong render result after cryptomatte commit Was checking the wrong field to see whether there are any cryptomatte passes in the scene. |
July 15, 2021, 15:15 (GMT) |
Cycles X: Bring back cryptomatte post-processing Is the non-accurate mode which is used for both CPU and GPU which is done as a post-processing pass after all samples have finished. This is happening via render scheduler, as it knows when path tracing did finish. Compared to regular Cycles this makes it so the cprypromatte pass is properly sorted with adaptive sampling enabled. The accurate CPU implementation which used to be done via the Coverage class is not yet hooked back. This needs to somehow happen either via the kernel or via the PathTraceWork. Current state of the patch should make it trivial to bring accurate implementation back. This change also fixes missing denoising when rendering when using constant time rendering. Differential Revision: https://developer.blender.org/D11934 |
July 15, 2021, 15:00 (GMT) |
Cycles X: Implement path compaction for shadow catcher The demo file is BMW27 with the ground set as a shadow catcher. The observed performance improvement is about 5% on RTX5000. The general idea is to schedule new tiles in a way that we always leave space for the shadow catcher. Roughly, we first schedule 50% of path states from the maximum number of paths, then 25% and so on. Summary of changes: - Replace constant offset of shadow catcher state with an atomically incrementing index. - Add new kernel to count number of states which can still spit. Could experiment with some atomics so that path split decreases a value, so does path termination, and increase it when new paths are added. Not sure this will give better performance. - Remove terminated paths kernel from scheduling. The paths are compacted, so we know they are in the beginning of the array. Differential Revision: https://developer.blender.org/D11932 |
July 15, 2021, 14:59 (GMT) |
Cycles X: Tweak max number of states seen by tile scheduler This is required for shadow catchers to make it so the tile scheduler gives work which can fir into the number of allowed camera rays. Use a smaller value from the maximum number of states to prepare code for state compaction of re-scheduling for the shadow catcher. Interestingly, this has positive effect on regular rendering here with RTX 5000: ``` new cycles-x bmw27.blend 12.445 12.2104 classroom.blend 24.4949 24.4508 pabellon.blend 11.3019 11.4407 monster.blend 13.409 13.4491 barbershop_interior.blend 18.6601 18.8364 junkshop.blend 26.3212 27.051 pvt_flat.blend 22.7389 22.9345 ``` For the future development we might try to make it so tile scheduler gives smaller tiles with smaller number of samples, rely on the path work GPU to request as many tiles as fit into the path states. Need to be careful though, because there are downsides in terms of memory bandwidth to pass works tiles to the init_from kernels. |
July 15, 2021, 14:59 (GMT) |
Fix Cycles X adaptive sampling convergence check The optimization of atomics and reduction was wrong: the warp voting functions operate on a threads from a warp (obviously), and the result of the vote is to be accumulated once for every warp. Thread index is measured within a block, not within a warp: a block can have a lot (GPU-dependent) number of threads, while warp has only 32 threads. Now the code does a voting and atomically adds to the result. This solves possible too-early sampling stop on GPU, but because the old code could have finished too soon, there is potential that the absolute render time number goes up. Is one of the things which is a bit hard to see on the real file, but the same approach was giving wrong approach during development of shadow catcher occupancy improvement. So best visualization of the problem so far was to force `converged` to be always false and print number of pixels and active pixels after the running kernel. Before this change the number of active pixels was much smaller than the number of pixels, now those values match. |
July 15, 2021, 11:55 (GMT) |
Cycles X: restore estimation of kernel memory usage for host memory fallback This makes it so that we don't allocate scene memory on the device, only to then find out later it has to move back to the host. Integrator working memory is now allocated before loading the kernels and allocating scene memory. This way it is included in the estimated kernel memory usage, and makes it less likely to be moved to the host. Differential Revision: https://developer.blender.org/D11922 |
July 15, 2021, 09:28 (GMT) |
Cycles X: Tweaks to the multi-device balancing There few ideas with this change: - Base on equalizing actual time devices are spent rendering, rather than trying to estimate this via performance-per-unit-work. This gives better estimate and covergence than the old calculation on the pabellon.blend. - Perform first re-balancing based on accumulated statistics after a short period of time rather than after first sample. This allows to accumulate a more accurate statistics. - Perform re-balancing more often even in the headless render when the balance is not ideal yet. There are some other changes, like perform rebalancing before path tracing. This way it seems to be easier to write logic in the scheduler. Headless render on RTX 5000 GPU and i9-11900k CPU: ``` new cycles-x bmw27.blend 14.8814 20.0281 classroom.blend 30.025 26.9318 pabellon.blend 13.1679 12.6133 monster.blend 16.4408 16.3826 barbershop_interior.blend 22.83 19.9255 junkshop.blend 28.7097 27.2703 pvt_flat.blend 24.7341 21.8464 ``` F12 render on the same configuration: ``` new cycles-x bmw27.blend 13.5106 13.9074 classroom.blend 31.3891 31.7155 pabellon.blend 12.3674 49.053 monster.blend 14.4754 13.6263 barbershop_interior.blend 24.8804 23.999 junkshop.blend 29.1324 27.267 pvt_flat.blend 25.6206 22.6731 ``` While this helps a lot for the pabellon file, other files seems to experience a slowdown. It is a bit hard to find a good balance between how often to perform device load rebalancing and how occupied to keep the devices. There is also some measurable deviation in the render times, depending on previous load and such. For example the pvt_flat.blend deviates between ~23 and ~27 seconds. Probably something to do with thermal profile and the fact that we allow to balance quickly and then schedule a big chunk of work to render. Not totally satisfied, but seems that overall this is a better heuristic. Differential Revision: https://developer.blender.org/D11897 |
July 15, 2021, 08:05 (GMT) |
Cleanup: Cycles X compilation warnings |
July 14, 2021, 16:47 (GMT) |
Fix error loading non-existent shadow kernel pass after recent changes |
July 14, 2021, 15:29 (GMT) |
Cycles X: make OptiX 7.3 the minimum required SDK version This ensure the new faster builtin curve intersection is used, and lets us simplify the code a bit. Differential Revision: https://developer.blender.org/D11866 |
July 14, 2021, 15:24 (GMT) |
July 14, 2021, 15:23 (GMT) |
Cycles X: reduce GPU state memory usage when some features are not enabled In particular: volumes, subsurface, denoising and light passes. In a scene without these features, we go from 538MB to 346MB for the state memory usage. This also improves performance, presumably due to reduced memory traffic. Differential Revision: https://developer.blender.org/D11915 |
July 14, 2021, 15:23 (GMT) |
Cycles X: change requested device features to bitflags So that they can be shared between host and device. Differential Revision: https://developer.blender.org/D11914 |
July 14, 2021, 15:23 (GMT) |
Cycles X: use less memory for float3 integrator state on GPU Allocate different device_only_memory size depending if the device is CPU or GPU, since for GPU we don't align to 16 bytes for SSE. Also adds some sanity checks and ensure float3 is not used in device_vector since it's incompatible for sharing data between CPU and GPU. Differential Revision: https://developer.blender.org/D11913 |
July 14, 2021, 15:23 (GMT) |
Cleanup: remove disabled OpenCL implementation To be replaced with something else for non-NVIDIA devices later. Makes it easier to do some of the upcoming changes. Differential Revision: https://developer.blender.org/D11912 |
July 14, 2021, 14:28 (GMT) |
Merge branch 'master' into cycles-x |
|