Comment 9 for bug 1755073

Revision history for this message
Paolo Pisati (p-pisati) wrote :

Ok, here is what i found so far:

the problem lies in the 'thunderx_zip' driver, that is the driver for the hw accelerated zip compressor / decompressor ip block - it kicks in once we select the deflate method for the zram device.

How way to reproduce it:

# modprobe zram
# echo 1 > /sys/block/zram0/reset
# echo deflate > /sys/block/zram0/comp_algorithm
# echo 128M > /sys/block/zram0/disksize
# mkfs.ext4 -F /dev/zram0
[stuck forever here]

Two trivial workarounds:

-blacklist the thunderx_zip kmod:

# rmmod thunderx_zip
# echo 'blacklist thunderx_zip' >> /etc/modprobe.d/blacklist.conf

or

-disable the CRYPTO_DEV_CAVIUM_ZIP kconfig (and void building thunderx_zip kmod) and recompile

As to what is causing it, till yesterday, it appeared as the problem was connected to the arm64 kpti but now i'm sure it is not:

the problem started to appear in 4.13.0-37-generic #42, while 4.13.0-36-generic #40 was immune and the only difference between those two kernels is the arm64 kpti patchset.
You can easily reproduce it in the upstream stable/linux-4.14.y tree too (4.14.26 is affected for example).

$ make defconfig
$ echo "CONFIG_CRYPTO_DEV_CAVIUM_ZIP=m" >> .config
$ make oldconfig

build and install as usual, and then try the reproducer above.

But what i found this morning, is that even the original 4.14.0 release is affected, but that release clearly doesn't contain the kpti patches.

Now what i want to try is:

1) test it on different hardware (one thing that i noticed is that if the thunderx_zip kmod is loaded at boot or later in the board life cycle, that slightly changes the error and that smells a lot like memory corruption)
2) test it with 4.15x and 4.16

I'll write another update when i have more data.