Comment 10 for bug 1840854

Revision history for this message
Matthew Ruffell (mruffell) wrote :

Hi Jens,

I highly recommend you go through the pain to upgrade the kernel on your GPU cluster to something modern, like 4.15.0-91-generic. There was quite a few regressions around the 4.15.0-56 to 4.15.0-58 mark, as we merged a lot of upstream stable patches in at that time.

4.15.0-91 is pretty stable these days, and you can probably leave it long term on that kernel.

In this bug, the fix landed in the mlx5_core driver, which is a kernel module. Kernel modules are only compatible with the kernel that they were compiled for, since Linux does not have a stable ABI / binary interface.

So, this isn't as easy as just copying over a fixed kernel module. The kmod package doesn't actually have any kernel modules in it, just the blacklists and things defined in /etc/modules-load.d and /etc/modprobe.d

Nvidia drivers should be built with dkms, and *should* work without too much hassle. I know that theory doesn't always align with reality though.

Anyway, I recommend you upgrade to a newer kernel on your GPU cluster.

Thanks,
Matthew