I've finished some research with this on one of the larger Pis (a 400) for performance measurements, and a memory limited Pi (Zero 2W) for an idea of the impact on memory usage. 

First the performance side of things: the good news is it doesn't make anything worse, the bad news is it doesn't make them *much* better (unlike on the PC where it apparently makes a substantial difference). The biggest difference on the Pi 400 was a drop from 16s to 13s in the jammy release of Firefox's cold startup time. However, there was no difference at all in the warm startup time, and the difference in the cold startup time disappeared almost entirely when using the candidate version of Firefox with the lang pack fix (diff was 0.3s -- which I'd assume is essentially nothing given I was doing manual timing and thus the times are subject to my reaction time of ~0.2s).

On the memory side of things I used a selection of 6 snaps installed on the Pi Zero 2W: mosquitto, node-red, micropython, node, ruby, and lxd (a combination of relatively common and fairly IoT specific snaps) which is probably about as much as anyone could reasonably expect to install and run on a half-gig system. I measured the system after a fresh boot running only default services, and otherwise idle on the Raspberry Pi jammy (22.04 LTS) arm64 server image. The arm64 architecture was selected partly for the wider availability of snaps (there are very few built for armhf) and partly because memory effects were more likely to be pronounced on this architecture. Over the course of 30 minutes, everything from /proc/meminfo was dumped to an SQLite database, first under the current release of the kernel (5.15.0-1011-raspi), then under a re-built version with CONFIG_SQUASHFS_DECOMP_MULTI_PERCPU enabled (and CONFIG_SQUASHFS_DECOMP_SINGLE disabled).

MemAvailable stayed reasonably stable across each run, and showed a median ~12MB less on the MULTI_PERCPU kernel than on the SINGLE kernel, suggesting that each snap's mount occupied somewhere in the region of 2MB more memory under the MULTI_PERCPU kernel. Other statistics were less clear cut: no significant difference in kernel stack, and MemFree actually showed the opposite (but this should probably be ignored as it doesn't exclude evictable pages like the disk cache, and the kernel has a duty to minimize it so it will typically fall predictably on a freshly booted system meaning that differences in the start time of a measurement run will lead to an irrelevant delta). The kernel slab measure showed a interesting (stable, median) 6MB increase from the SINGLE to MULTI_PERCPU which may account for some of the extra memory being used.

Conclusion:

* There's some memory loss but it's small enough that it shouldn't significantly impact even tightly constrained systems like the Zero 2W
* The performance gains are likely minimal under ARM. As such, were ARM the only set of architectures being considered I'd probably recommend against this
* However, the performance gains on the x86 family are significant so for me this comes down to a "lack of harm" judgment:

If this can be enabled on amd64 and disabled on armhf+arm64 then that would probably be the ideal situation. However, I'm mindful that any delta means extra maintenance work, and an extra chance for errors. Given the mixed situation on ARM (minor performance gain, minor memory loss), I'd recommend simply enabling this option across the board.