ubuntu_zram_smoke test will cause soft lockup on Artful ThunderX ARM64

Bug #1755073 reported by Po-Hsu Lin on 2018-03-12
18
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Undecided
Paolo Pisati
Artful
Undecided
Paolo Pisati
Bionic
Undecided
Paolo Pisati

Bug Description

== SRU Justification ==

Enabling virtual mapped kernel stacks breaks the thunderx_zip driver. On compression or decompression the executing CPU hangs in an endless loop. The reason for this is the usage of __pa() by the driver that does not work for an address that is not part of the 1:1 mapping.

== Fix ==

The patch in comment #12 fixes it.
To reproduce the bug (and test the new kernel), see comment #9.

== Regression Potential ==

The patch affects only the thunderx_zip driver (that is already in a completely broken state), so the regression potential is low.

---

This is a POTENTIAL REGRESSION.

This test has passed with 4.13.0-36.40, but not 4.13.0-37.42

The test will stuck with error message in dmesg:
ubuntu@starmie-kernel:~$ dmesg
[ 470.227210] zram: Added device: zram0
[ 472.262544] zram0: detected capacity change from 0 to 134217728
[ 472.396960] EXT4-fs (zram0): mounted filesystem with ordered data mode. Opts: (null)
[ 475.761947] zram0: detected capacity change from 134217728 to 0
[ 476.796641] zram0: detected capacity change from 0 to 134217728
[ 476.909118] EXT4-fs (zram0): mounted filesystem with ordered data mode. Opts: (null)
[ 480.233817] zram0: detected capacity change from 134217728 to 0
[ 481.239001] zram0: detected capacity change from 0 to 134217728
[ 508.079684] watchdog: BUG: soft lockup - CPU#8 stuck for 23s! [mkfs.ext4:2253]
[ 508.086994] Modules linked in: lz4 lz4_compress zram nls_iso8859_1 thunderx_edac i2c_thunderx thunderx_zip i2c_smbus shpchp cavium_rng_vf cavium_rng gpio_keys ipmi_ssif uio_pdrv_genirq uio ipmi_devintf ipmi_msghandler ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear nicvf nicpf ast i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect aes_ce_blk sysimgblt aes_ce_cipher fb_sys_fops crc32_ce crct10dif_ce drm ghash_ce sha2_ce sha1_ce ahci libahci thunder_bgx thunder_xcv mdio_thunder thunderx_mmc mdio_cavium aes_neon_bs aes_neon_blk crypto_simd cryptd
[ 508.087138] CPU: 8 PID: 2253 Comm: mkfs.ext4 Not tainted 4.13.0-37-generic #42-Ubuntu
[ 508.087141] Hardware name: Cavium ThunderX CRB/To be filled by O.E.M., BIOS 5.11 12/12/2012
[ 508.087144] task: ffff801f57a86900 task.stack: ffff000023ec0000
[ 508.087158] PC is at zip_deflate+0x184/0x2e8 [thunderx_zip]
[ 508.087164] LR is at zip_deflate+0x17c/0x2e8 [thunderx_zip]
[ 508.087167] pc : [<ffff000001a6961c>] lr : [<ffff000001a69614>] pstate: 20400145
[ 508.087169] sp : ffff000023ec3580
[ 508.087171] x29: ffff000023ec3580 x28: 0000000000000000
[ 508.087177] x27: ffff801f57a86900 x26: 000000010000b104
[ 508.087182] x25: ffff801f5618c000 x24: ffff801f6a15bc18
[ 508.087188] x23: 0000000000001000 x22: ffff000023ec3688
[ 508.087193] x21: ffff801f5a2d4880 x20: ffff801f6a15bc18
[ 508.087199] x19: ffff000023ec3608 x18: 0000000000000000
[ 508.087204] x17: 0000000000000001 x16: 0000000000000000
[ 508.087209] x15: 0000aaaabd4048db x14: 00000000fa130000
[ 508.087215] x13: 0000100000000000 x12: 00000000fa120000
[ 508.087220] x11: 0000000000000000 x10: 0000000000000000
[ 508.087225] x9 : 0000000000000000 x8 : 0000000000000000
[ 508.087230] x7 : 0000000000000000 x6 : ffff8000101d6080
[ 508.087236] x5 : ffff000001a69058 x4 : 0000000000000008
[ 508.087241] x3 : 0000000000000000 x2 : 0000000000000000
[ 508.087246] x1 : ffff801f6a15be08 x0 : 0000000000000000
[ 508.087252] Call trace:
[ 508.087256] Exception stack(0xffff000023ec3440 to 0xffff000023ec3580)
[ 508.087260] 3440: 0000000000000000 ffff801f6a15be08 0000000000000000 0000000000000000
[ 508.087264] 3460: 0000000000000008 ffff000001a69058 ffff8000101d6080 0000000000000000
[ 508.087268] 3480: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 508.087272] 34a0: 00000000fa120000 0000100000000000 00000000fa130000 0000aaaabd4048db
[ 508.087276] 34c0: 0000000000000000 0000000000000001 0000000000000000 ffff000023ec3608
[ 508.087280] 34e0: ffff801f6a15bc18 ffff801f5a2d4880 ffff000023ec3688 0000000000001000
[ 508.087284] 3500: ffff801f6a15bc18 ffff801f5618c000 000000010000b104 ffff801f57a86900
[ 508.087288] 3520: 0000000000000000 ffff000023ec3580 ffff000001a69614 ffff000023ec3580
[ 508.087292] 3540: ffff000001a6961c 0000000020400145 ffff801f5a2d4880 ffff000023ec3688
[ 508.087297] 3560: 0000ffffffffffff ffff801f6a15bc18 ffff000023ec3580 ffff000001a6961c
[ 508.087304] [<ffff000001a6961c>] zip_deflate+0x184/0x2e8 [thunderx_zip]
[ 508.087311] [<ffff000001a68db0>] zip_compress+0xd8/0x158 [thunderx_zip]
[ 508.087317] [<ffff000001a690a0>] zip_comp_compress+0x48/0x60 [thunderx_zip]
[ 508.087325] [<ffff0000084b1280>] crypto_compress+0x50/0x68
[ 508.087336] [<ffff000001c7a298>] zcomp_compress+0x48/0x58 [zram]
[ 508.087343] [<ffff000001c7bb1c>] zram_bvec_rw.isra.18+0x21c/0x660 [zram]
[ 508.087350] [<ffff000001c7c184>] zram_make_request+0x13c/0x360 [zram]
[ 508.087356] [<ffff0000084dea94>] generic_make_request+0xf4/0x290
[ 508.087360] [<ffff0000084dec8c>] submit_bio+0x5c/0x198
[ 508.087367] [<ffff000008311944>] submit_bh_wbc+0x14c/0x1a0
[ 508.087371] [<ffff000008311bb4>] __block_write_full_page+0x21c/0x3b0
[ 508.087375] [<ffff000008311f4c>] block_write_full_page+0x10c/0x120
[ 508.087379] [<ffff0000083155e0>] blkdev_writepage+0x30/0x40
[ 508.087383] [<ffff000008237520>] __writepage+0x38/0x88
[ 508.087386] [<ffff000008237ddc>] write_cache_pages+0x1c4/0x490
[ 508.087389] [<ffff000008239a34>] generic_writepages+0x64/0xa0
[ 508.087392] [<ffff000008315550>] blkdev_writepages+0x38/0x60
[ 508.087395] [<ffff00000823a644>] do_writepages+0x5c/0x108
[ 508.087400] [<ffff000008229d9c>] __filemap_fdatawrite_range+0xec/0x128
[ 508.087404] [<ffff00000822a5a8>] file_write_and_wait_range+0x58/0xe8
[ 508.087407] [<ffff0000083147e4>] blkdev_fsync+0x3c/0x70
[ 508.087410] [<ffff00000830b77c>] vfs_fsync_range+0x64/0xc0
[ 508.087413] [<ffff00000830b860>] do_fsync+0x48/0x78
[ 508.087417] [<ffff00000830bb24>] SyS_fsync+0x24/0x38
[ 508.087419] Exception stack(0xffff000023ec3ec0 to 0xffff000023ec4000)
[ 508.087423] 3ec0: 0000000000000008 0000000000000000 0000000000001000 0000000000001000
[ 508.087427] 3ee0: 0000000000000000 0000aaaabd4048d4 0000000000000000 000000001862beef
[ 508.087431] 3f00: 0000000000000052 00000000000003f8 0000000000000400 0000000000000800
[ 508.087435] 3f20: 0000000000000c00 0000000000001000 0000000000001c00 0000aaaabd4048db
[ 508.087439] 3f40: 0000ffffa72c4c98 0000ffffa714fb58 0000000000000000 0000000000000000
[ 508.087443] 3f60: 0000aaaabd4001d0 0000000000000000 0000000000000000 0000aaaabd403f70
[ 508.087447] 3f80: 0000ffffecf836c0 0000aaaabd4044e0 0000aaaabd403f70 0000000000000000
[ 508.087451] 3fa0: 0000aaaabd403f70 0000ffffecf834b0 0000ffffa72aa0a4 0000ffffecf834b0
[ 508.087455] 3fc0: 0000ffffa714fb7c 0000000060000000 0000000000000008 0000000000000052
[ 508.087458] 3fe0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 508.087463] [<ffff000008083c00>] el0_svc_naked+0x34/0x38
[ 528.074307] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [kworker/u96:3:512]
[ 528.081877] Modules linked in: lz4 lz4_compress zram nls_iso8859_1 thunderx_edac i2c_thunderx thunderx_zip i2c_smbus shpchp cavium_rng_vf cavium_rng gpio_keys ipmi_ssif uio_pdrv_genirq uio ipmi_devintf ipmi_msghandler ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear nicvf nicpf ast i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect aes_ce_blk sysimgblt aes_ce_cipher fb_sys_fops crc32_ce crct10dif_ce drm ghash_ce sha2_ce sha1_ce ahci libahci thunder_bgx thunder_xcv mdio_thunder thunderx_mmc mdio_cavium aes_neon_bs aes_neon_blk crypto_simd cryptd
[ 528.082019] CPU: 2 PID: 512 Comm: kworker/u96:3 Tainted: G L 4.13.0-37-generic #42-Ubuntu
[ 528.082021] Hardware name: Cavium ThunderX CRB/To be filled by O.E.M., BIOS 5.11 12/12/2012
[ 528.082033] Workqueue: writeback wb_workfn (flush-252:0)
[ 528.082039] task: ffff801f27b58000 task.stack: ffff000013b00000
[ 528.082052] PC is at zip_deflate+0x180/0x2e8 [thunderx_zip]
[ 528.082059] LR is at zip_deflate+0x17c/0x2e8 [thunderx_zip]
[ 528.082062] pc : [<ffff000001a69618>] lr : [<ffff000001a69614>] pstate: 20400145
[ 528.082064] sp : ffff000013b03330
[ 528.082066] x29: ffff000013b03330 x28: 0000000000000000
[ 528.082071] x27: ffff801f27b58000 x26: 000000010000c700
[ 528.082077] x25: ffff801f57e04000 x24: ffff801f6a15bc18
[ 528.082082] x23: 0000000000001000 x22: ffff000013b03438
[ 528.082087] x21: ffff801f5a2d5880 x20: ffff801f6a15bc18
[ 528.082093] x19: ffff000013b033b8 x18: 0000000000000009
[ 528.082098] x17: 0000000000000002 x16: 0000000000000000
[ 528.082103] x15: 0000000000000001 x14: 0000000000db0000
[ 528.082108] x13: 0000100000000000 x12: 0000000000da0000
[ 528.082113] x11: 0000000000000000 x10: 0000000000000000
[ 528.082119] x9 : 0000000000000000 x8 : 0000000000000000
[ 528.082124] x7 : 0000000000000000 x6 : ffff8000101d6100
[ 528.082129] x5 : ffff000001a69058 x4 : 0000000000000008
[ 528.082134] x3 : 0000000000000000 x2 : 0000000000000000
[ 528.082139] x1 : ffff801f6a15be08 x0 : 0000000000000000
[ 528.082145] Call trace:
[ 528.082149] Exception stack(0xffff000013b031f0 to 0xffff000013b03330)
[ 528.082152] 31e0: 0000000000000000 ffff801f6a15be08
[ 528.082156] 3200: 0000000000000000 0000000000000000 0000000000000008 ffff000001a69058
[ 528.082160] 3220: ffff8000101d6100 0000000000000000 0000000000000000 0000000000000000
[ 528.082164] 3240: 0000000000000000 0000000000000000 0000000000da0000 0000100000000000
[ 528.082167] 3260: 0000000000db0000 0000000000000001 0000000000000000 0000000000000002
[ 528.082171] 3280: 0000000000000009 ffff000013b033b8 ffff801f6a15bc18 ffff801f5a2d5880
[ 528.082175] 32a0: ffff000013b03438 0000000000001000 ffff801f6a15bc18 ffff801f57e04000
[ 528.082179] 32c0: 000000010000c700 ffff801f27b58000 0000000000000000 ffff000013b03330
[ 528.082183] 32e0: ffff000001a69614 ffff000013b03330 ffff000001a69618 0000000020400145
[ 528.082187] 3300: ffff801f5a2d5880 ffff000013b03438 ffffffffffffffff ffff801f6a15bc18
[ 528.082190] 3320: ffff000013b03330 ffff000001a69618
[ 528.082197] [<ffff000001a69618>] zip_deflate+0x180/0x2e8 [thunderx_zip]
[ 528.082204] [<ffff000001a68db0>] zip_compress+0xd8/0x158 [thunderx_zip]
[ 528.082210] [<ffff000001a690a0>] zip_comp_compress+0x48/0x60 [thunderx_zip]
[ 528.082216] [<ffff0000084b1280>] crypto_compress+0x50/0x68
[ 528.082228] [<ffff000001c7a298>] zcomp_compress+0x48/0x58 [zram]
[ 528.082236] [<ffff000001c7bb1c>] zram_bvec_rw.isra.18+0x21c/0x660 [zram]
[ 528.082242] [<ffff000001c7c184>] zram_make_request+0x13c/0x360 [zram]
[ 528.082248] [<ffff0000084dea94>] generic_make_request+0xf4/0x290
[ 528.082252] [<ffff0000084dec8c>] submit_bio+0x5c/0x198
[ 528.082258] [<ffff000008311944>] submit_bh_wbc+0x14c/0x1a0
[ 528.082262] [<ffff000008311bb4>] __block_write_full_page+0x21c/0x3b0
[ 528.082266] [<ffff000008311f4c>] block_write_full_page+0x10c/0x120
[ 528.082270] [<ffff0000083155e0>] blkdev_writepage+0x30/0x40
[ 528.082274] [<ffff000008237520>] __writepage+0x38/0x88
[ 528.082277] [<ffff000008237ddc>] write_cache_pages+0x1c4/0x490
[ 528.082280] [<ffff000008239a34>] generic_writepages+0x64/0xa0
[ 528.082283] [<ffff000008315550>] blkdev_writepages+0x38/0x60
[ 528.082286] [<ffff00000823a644>] do_writepages+0x5c/0x108
[ 528.082290] [<ffff000008305938>] __writeback_single_inode+0x48/0x3f0
[ 528.082293] [<ffff000008306158>] writeback_sb_inodes+0x1c0/0x460
[ 528.082296] [<ffff000008306470>] __writeback_inodes_wb+0x78/0xc8
[ 528.082299] [<ffff0000083067bc>] wb_writeback+0x224/0x350
[ 528.082302] [<ffff000008307254>] wb_workfn+0x1c4/0x400
[ 528.082308] [<ffff0000080fe5e8>] process_one_work+0x1e0/0x420
[ 528.082312] [<ffff0000080fe874>] worker_thread+0x4c/0x478
[ 528.082316] [<ffff000008105848>] kthread+0x138/0x140
[ 528.082321] [<ffff0000080854f0>] ret_from_fork+0x10/0x18

ProblemType: Bug
DistroRelease: Ubuntu 17.10
Package: linux-image-4.13.0-37-generic 4.13.0-37.42
ProcVersionSignature: User Name 4.13.0-37.42-generic 4.13.13
Uname: Linux 4.13.0-37-generic aarch64
AlsaDevices:
 total 0
 crw-rw---- 1 root audio 116, 1 Mar 12 03:13 seq
 crw-rw---- 1 root audio 116, 33 Mar 12 03:13 timer
AplayDevices: Error: [Errno 2] No such file or directory: 'aplay': 'aplay'
ApportVersion: 2.20.7-0ubuntu3.7
Architecture: arm64
ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord': 'arecord'
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
Date: Mon Mar 12 03:22:24 2018
IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig': 'iwconfig'
MachineType: Cavium ThunderX CRB
PciMultimedia:

ProcEnviron:
 TERM=xterm-256color
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=C.UTF-8
 SHELL=/bin/bash
ProcFB:
 0 EFI VGA
 1 astdrmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.13.0-37-generic root=UUID=c5160401-c147-4329-9683-b946ce7d0d37 ro
RelatedPackageVersions:
 linux-restricted-modules-4.13.0-37-generic N/A
 linux-backports-modules-4.13.0-37-generic N/A
 linux-firmware 1.169.3
RfKill: Error: [Errno 2] No such file or directory: 'rfkill': 'rfkill'
SourcePackage: linux
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 12/12/2012
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: 5.11
dmi.board.asset.tag: To be filled by O.E.M.
dmi.board.name: To be filled by O.E.M.
dmi.board.vendor: To be filled by O.E.M.
dmi.board.version: To be filled by O.E.M.
dmi.chassis.asset.tag: To be filled by O.E.M.
dmi.chassis.type: 0
dmi.chassis.vendor: Cavium
dmi.chassis.version: To be filled by O.E.M.
dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvr5.11:bd12/12/2012:svnCavium:pnThunderXCRB:pvrTobefilledbyO.E.M.:rvnTobefilledbyO.E.M.:rnTobefilledbyO.E.M.:rvrTobefilledbyO.E.M.:cvnCavium:ct0:cvrTobefilledbyO.E.M.:
dmi.product.family: Default string
dmi.product.name: ThunderX CRB
dmi.product.version: To be filled by O.E.M.
dmi.sys.vendor: Cavium

CVE References

Po-Hsu Lin (cypressyew) wrote :

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
Po-Hsu Lin (cypressyew) on 2018-03-12
summary: - ubuntu_zram_smoke test will cause soft lockup with Artful ARM64 kernel
+ ubuntu_zram_smoke test will cause soft lockup on Artful ARM64 kernel
Changed in linux (Ubuntu):
assignee: nobody → Colin Ian King (colin-king)

How do i reproduce this?

Changed in linux (Ubuntu):
assignee: Colin Ian King (colin-king) → Paolo Pisati (p-pisati)
Colin Ian King (colin-king) wrote :

git clone git://kernel.ubuntu.com/ubuntu/autotest
git clone git://kernel.ubuntu.com/ubuntu/autotest-client-tests
cd autotest/client
rmdir tests
ln -s ../../autotest-client-tests tests
cd ../..
sudo autotest/client/autotest-local autotest/client/tests/ubuntu_zram_smoke_test/control

Paolo Pisati (p-pisati) wrote :
Download full text (17.7 KiB)

It's an upstream bug - i can reproduce it on lundmark (thunderx) with 4.14.26 (latest 4.14 stable release) using the artful .config (make oldconfig and accept all defaults values - i'll attach the config below).

It appears it only shows up on ThunderX SOCs (the same test on a hisilicon d05 worked fine), and the toolchain doesn't matter, it shows up when the kernel was compiled with the artful or xenial toolchain, no matter what.

I'm reducing the scope before reporting it upstream.

ubuntu@lundmark:~$ uname -a
Linux lundmark 4.14.26+ #3 SMP Mon Mar 12 15:48:47 UTC 2018 aarch64 aarch64 aarch64 GNU/Linux

ubuntu@lundmark:~$ sudo ./zram_smoke_test.sh
PASSED: zram module loaded
PASSED: got 48 compression streams
PASSED: got lzo lz4 deflate lz4hc 842 compression algorithmns
PASSED: mkfs.ext4 on ZRAM successful
PASSED: ext4 on ZRAM (lzo), memory used: 41945534 4497
PASSED: mkfs.ext4 on ZRAM successful
PASSED: ext4 on ZRAM (lz4), memory used: 41945256 4006

[from the serial console]
dmesg:
...
[ 289.708789] zram: Added device: zram0
[ 291.743271] zram0: detected capacity change from 0 to 134217728
[ 291.909193] EXT4-fs (zram0): mounted filesystem with ordered data mode. Opts: (null)
[ 295.177609] zram0: detected capacity change from 134217728 to 0
[ 296.291895] zram0: detected capacity change from 0 to 134217728
[ 296.394257] EXT4-fs (zram0): mounted filesystem with ordered data mode. Opts: (null)
[ 299.602110] zram0: detected capacity change from 134217728 to 0
[ 300.606078] zram0: detected capacity change from 0 to 134217728
[ 327.694207] watchdog: BUG: soft lockup - CPU#25 stuck for 23s! [mkfs.ext4:2016]
[ 327.701593] Modules linked in: lz4 lz4_compress zram nicvf nicpf nls_iso8859_1 ast ttm thunder_bgx drm_kms_helper thunder_xcv thunderx_mmc mdio_thund
er mdio_cavium drm fb_sys_fops syscopyarea sysfillrect sysimgblt i2c_thunderx i2c_smbus i2c_algo_bit thunderx_edac thunderx_zip cavium_rng_vf shpchp cav
ium_rng aes_ce_blk crypto_simd cryptd aes_ce_cipher crc32_ce crct10dif_ce ghash_ce ipmi_ssif aes_arm64 ipmi_devintf gpio_keys sha2_ce sha256_arm64 sha1_
ce ipmi_msghandler uio_pdrv_genirq uio ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autof
s4
[ 327.701693] CPU: 25 PID: 2016 Comm: mkfs.ext4 Not tainted 4.14.26+ #3
[ 327.701695] Hardware name: Cavium ThunderX CRB/To be filled by O.E.M., BIOS 5.11 12/12/2012
[ 327.701698] task: ffff801f56dc0f00 task.stack: ffff000012768000
[ 327.701711] PC is at zip_deflate+0x180/0x2e8 [thunderx_zip]
[ 327.701717] LR is at zip_deflate+0x17c/0x2e8 [thunderx_zip]
[ 327.701720] pc : [<ffff000002031618>] lr : [<ffff000002031614>] pstate: 20400145
[ 327.701721] sp : ffff00001276b570
[ 327.701723] x29: ffff00001276b570 x28: 0000000000000000
[ 327.701728] x27: ffff801f27f2ec00 x26: 0000000000000000
[ 327.701733] x25: ffff801f58b5c000 x24: ffff801f6f72ac18
[ 327.701738] x23: 0000000000001000 x22: ffff00001276b678
[ 327.701743] x21: ffff801f59214880 x20: ffff801f6f72ac18
[ 327.701748] x19: ffff00001276b5f8 x18: 0000000000000000
[ 327.701752] x17: 0000000000000001 x16: 0000000000000000
[ 327.701757] x15: 0000aaab222018db x14: 00...

Paolo Pisati (p-pisati) wrote :
Po-Hsu Lin (cypressyew) wrote :

One thing to note, this issue can be reproduced with the Xenial HWE kernel on ThunderX systems as well. (Seems ok with Moonshot system)

Po-Hsu Lin (cypressyew) wrote :

Tested with a Moonshot system, this ubuntu_zram_smoke test can pass with Artful kernel.

So this matches ppisati's finding in comment #5

summary: - ubuntu_zram_smoke test will cause soft lockup on Artful ARM64 kernel
+ ubuntu_zram_smoke test will cause soft lockup on Artful ThunderX ARM64
Paolo Pisati (p-pisati) wrote :

Ok, here is what i found so far:

the problem lies in the 'thunderx_zip' driver, that is the driver for the hw accelerated zip compressor / decompressor ip block - it kicks in once we select the deflate method for the zram device.

How way to reproduce it:

# modprobe zram
# echo 1 > /sys/block/zram0/reset
# echo deflate > /sys/block/zram0/comp_algorithm
# echo 128M > /sys/block/zram0/disksize
# mkfs.ext4 -F /dev/zram0
[stuck forever here]

Two trivial workarounds:

-blacklist the thunderx_zip kmod:

# rmmod thunderx_zip
# echo 'blacklist thunderx_zip' >> /etc/modprobe.d/blacklist.conf

or

-disable the CRYPTO_DEV_CAVIUM_ZIP kconfig (and void building thunderx_zip kmod) and recompile

As to what is causing it, till yesterday, it appeared as the problem was connected to the arm64 kpti but now i'm sure it is not:

the problem started to appear in 4.13.0-37-generic #42, while 4.13.0-36-generic #40 was immune and the only difference between those two kernels is the arm64 kpti patchset.
You can easily reproduce it in the upstream stable/linux-4.14.y tree too (4.14.26 is affected for example).

$ make defconfig
$ echo "CONFIG_CRYPTO_DEV_CAVIUM_ZIP=m" >> .config
$ make oldconfig

build and install as usual, and then try the reproducer above.

But what i found this morning, is that even the original 4.14.0 release is affected, but that release clearly doesn't contain the kpti patches.

Now what i want to try is:

1) test it on different hardware (one thing that i noticed is that if the thunderx_zip kmod is loaded at boot or later in the board life cycle, that slightly changes the error and that smells a lot like memory corruption)
2) test it with 4.15x and 4.16

I'll write another update when i have more data.

Paolo Pisati (p-pisati) wrote :

Ok, i tested it on another board (seuss) with 4.14.27, 4.15.10 and 4.16-rc4:

4.14.0: affected
4.14.27: affected
4.15.10: affected
4.16-rc: doesn't boot at all, so i can't tell

so we can rule out the bad hardware variable, and we can say for sure that is not related to kpti at all.

Another thing: once building a kernel from upstream, you clearly need to enable ZSMALLOC and ZRAM too.

Jan Glauber (jan-glauber-i) wrote :

I can reproduce this issue with 4.13.0-37-generic. The debugfs info for the zip driver shows that 2 requests are pending. Either the requests are not submitted properly (Decomp Req Submitted=0) or
not completed (DOORBELL regs = 0).

root@crb1s:/sys/kernel/debug/thunderx_zip# cat zip_stats
        ZIP Device 0 Stats
-----------------------------------
Comp Req Submitted : 0
Comp Req Completed : 0
Compress In Bytes : 0
Compressed Out Bytes : 0
Average Chunk size : 0
Average Compression ratio : 0
Decomp Req Submitted : 0
Decomp Req Completed : 0
Decompress In Bytes : 0
Decompressed Out Bytes : 0
Decompress Bad requests : 0
Pending Req : 2
---------------------------------
root@crb1s:/sys/kernel/debug/thunderx_zip# cat zip_regs
--------------------------------
     ZIP Device 0 Registers
--------------------------------
ZIP_CMD_CTL : 0x0000000000000002
ZIP_THROTTLE : 0x0000000000000010
ZIP_CONSTANTS : 0x02017c0020060000
ZIP_QUE0_MAP : 0x0000000000000003
ZIP_QUE1_MAP : 0x0000000000000003
ZIP_QUE_ENA : 0x0000000000000003
ZIP_QUE_PRI : 0x0000000000000003
ZIP_QUE0_DONE : 0x0000000000000000
ZIP_QUE1_DONE : 0x0000000000000000
ZIP_QUE0_DOORBELL : 0x0000000000000000
ZIP_QUE1_DOORBELL : 0x0000000000000000
ZIP_QUE0_SBUF_ADDR : 0x00000000fa040100
ZIP_QUE1_SBUF_ADDR : 0x00000000fa042000
ZIP_QUE0_SBUF_CTL : 0x000003f100000000
ZIP_QUE1_SBUF_CTL : 0x000003f100000000

Jan Glauber (jan-glauber-i) wrote :

This is a regression caused by CONFIG_VMAP_STACK. The driver uses __pa (phys_to_virt)
on a stack address which does not work with virtual mapped stacks.

To solve this upstream will require a bit more work, I'm attaching a minimal patch
to work-around by using kmalloc for the problematic allocation.

tags: added: patch
Paolo Pisati (p-pisati) on 2018-03-27
description: updated
Changed in linux (Ubuntu Bionic):
status: Confirmed → Fix Committed
Po-Hsu Lin (cypressyew) wrote :

Can be reproduced with X-hwe on ThunderX as well.

Changed in linux (Ubuntu Artful):
status: New → In Progress
assignee: nobody → Paolo Pisati (p-pisati)
Changed in linux (Ubuntu Artful):
status: In Progress → Fix Committed
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-artful' to 'verification-done-artful'. If the problem still exists, change the tag 'verification-needed-artful' to 'verification-failed-artful'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-artful
Launchpad Janitor (janitor) wrote :
Download full text (40.4 KiB)

This bug was fixed in the package linux - 4.15.0-15.16

---------------
linux (4.15.0-15.16) bionic; urgency=medium

  * linux: 4.15.0-15.16 -proposed tracker (LP: #1761177)

  * FFe: Enable configuring resume offset via sysfs (LP: #1760106)
    - PM / hibernate: Make passing hibernate offsets more friendly

  * /dev/bcache/by-uuid links not created after reboot (LP: #1729145)
    - SAUCE: (no-up) bcache: decouple emitting a cached_dev CHANGE uevent

  * Ubuntu18.04:POWER9:DD2.2 - Unable to start a KVM guest with default machine
    type(pseries-bionic) complaining "KVM implementation does not support
    Transactional Memory, try cap-htm=off" (kvm) (LP: #1752026)
    - powerpc: Use feature bit for RTC presence rather than timebase presence
    - powerpc: Book E: Remove unused CPU_FTR_L2CSR bit
    - powerpc: Free up CPU feature bits on 64-bit machines
    - powerpc: Add CPU feature bits for TM bug workarounds on POWER9 v2.2
    - powerpc/powernv: Provide a way to force a core into SMT4 mode
    - KVM: PPC: Book3S HV: Work around transactional memory bugs in POWER9
    - KVM: PPC: Book3S HV: Work around XER[SO] bug in fake suspend mode
    - KVM: PPC: Book3S HV: Work around TEXASR bug in fake suspend state

  * Important Kernel fixes to be backported for Power9 (kvm) (LP: #1758910)
    - powerpc/mm: Fixup tlbie vs store ordering issue on POWER9

  * Ubuntu 18.04 - IO Hang on some namespaces when running HTX with 16
    namespaces (Bolt / NVMe) (LP: #1757497)
    - powerpc/64s: Fix lost pending interrupt due to race causing lost update to
      irq_happened

  * fwts-efi-runtime-dkms 18.03.00-0ubuntu1: fwts-efi-runtime-dkms kernel module
    failed to build (LP: #1760876)
    - [Packaging] include the retpoline extractor in the headers

linux (4.15.0-14.15) bionic; urgency=medium

  * linux: 4.15.0-14.15 -proposed tracker (LP: #1760678)

  * [Bionic] mlx4 ETH - mlnx_qos failed when set some TC to vendor
    (LP: #1758662)
    - net/mlx4_en: Change default QoS settings

  * AT_BASE_PLATFORM in AUXV is absent on kernels available on Ubuntu 17.10
    (LP: #1759312)
    - powerpc/64s: Fix NULL AT_BASE_PLATFORM when using DT CPU features

  * Bionic update to 4.15.15 stable release (LP: #1760585)
    - net: dsa: Fix dsa_is_user_port() test inversion
    - openvswitch: meter: fix the incorrect calculation of max delta_t
    - qed: Fix MPA unalign flow in case header is split across two packets.
    - tcp: purge write queue upon aborting the connection
    - qed: Fix non TCP packets should be dropped on iWARP ll2 connection
    - sysfs: symlink: export sysfs_create_link_nowarn()
    - net: phy: relax error checking when creating sysfs link netdev->phydev
    - devlink: Remove redundant free on error path
    - macvlan: filter out unsupported feature flags
    - net: ipv6: keep sk status consistent after datagram connect failure
    - ipv6: old_dport should be a __be16 in __ip6_datagram_connect()
    - ipv6: sr: fix NULL pointer dereference when setting encap source address
    - ipv6: sr: fix scheduling in RCU when creating seg6 lwtunnel state
    - mlxsw: spectrum_buffers: Set a minimum quota for CPU port traffic
    - net: phy: Tell caller result ...

Changed in linux (Ubuntu Bionic):
status: Fix Committed → Fix Released
Manoj Iyer (manjo) wrote :

-- before proposed --
Ubuntu 17.10 (GNU/Linux 4.13.0-38-generic aarch64)
ubuntu@anuchin:~$ sudo bash
root@anuchin:~# modprobe zram
root@anuchin:~# echo 1 > /sys/block/zram0/reset
root@anuchin:~# echo deflate > /sys/block/zram0/comp_algorithm
root@anuchin:~# echo 128M > /sys/block/zram0/disksize
root@anuchin:~# mkfs.ext4 -F /dev/zram0
mke2fs 1.43.5 (04-Aug-2017)
Discarding device blocks: done
Creating filesystem with 32768 4k blocks and 32768 inodes

Allocating group tables: done
Writing inode tables: done
Creating journal (4096 blocks): done
Writing superblocks and filesystem accounting information:

manjo@lazy:~$ ipmitool -H 10.229.48.14 -I lanplus -U admin -P admin sol activate
[SOL Session operational. Use ~? for help]
[ 1420.012055] watchdog: BUG: soft lockup - CPU#14 stuck for 22s! [mkfs.ext4:3121]
[ 1424.023938] watchdog: BUG: soft lockup - CPU#40 stuck for 23s! [kworker/u96:2:406]
[ 1454.099071] INFO: rcu_sched self-detected stall on CPU
[ 1454.104260] 14-...: (14998 ticks this GP) idle=796/140000000000001/0 softirq=4896/4896 fqs=7500
[ 1454.107070] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 1454.107076] 14-...: (14998 ticks this GP) idle=796/140000000000001/0 softirq=4896/4896 fqs=7500
[ 1454.107077] (detected by 40, t=15002 jiffies, g=3929, c=3928, q=4597)
[ 1454.134272] (t=15008 jiffies g=3929 c=3928 q=4597)

-- after proposed --
ubuntu@anuchin:~$ uname -a
Linux anuchin 4.13.0-39-generic #44-Ubuntu SMP Thu Apr 5 14:20:13 UTC 2018 aarch64 aarch64 aarch64 GNU/Linux
ubuntu@anuchin:~$ sudo bash
root@anuchin:~# modprobe zram
root@anuchin:~# echo 1 > /sys/block/zram0/reset
root@anuchin:~# echo deflate > /sys/block/zram0/comp_algorithm
root@anuchin:~# echo 128M > /sys/block/zram0/disksize
root@anuchin:~# mkfs.ext4 -F /dev/zram0
mke2fs 1.43.5 (04-Aug-2017)
Discarding device blocks: done
Creating filesystem with 32768 4k blocks and 32768 inodes

Allocating group tables: done
Writing inode tables: done
Creating journal (4096 blocks): done
Writing superblocks and filesystem accounting information: done

root@anuchin:~#

tags: added: verification-done-artful
removed: verification-needed-artful
Launchpad Janitor (janitor) wrote :
Download full text (5.6 KiB)

This bug was fixed in the package linux - 4.13.0-39.44

---------------
linux (4.13.0-39.44) artful; urgency=medium

  * linux: 4.13.0-39.44 -proposed tracker (LP: #1761456)

  * intel-microcode 3.20180312.0 causes lockup at login screen(w/ linux-
    image-4.13.0-37-generic) (LP: #1759920) // CVE-2017-5715 (Spectre v2
    Intel) // CVE-2017-5754
    - x86/mm: Reinitialize TLB state on hotplug and resume

  * intel-microcode 3.20180312.0 causes lockup at login screen(w/ linux-
    image-4.13.0-37-generic) (LP: #1759920) // CVE-2017-5715 (Spectre v2 Intel)
    - Revert "x86/mm: Only set IBPB when the new thread cannot ptrace current
      thread"
    - x86/speculation: Use Indirect Branch Prediction Barrier in context switch

  * DKMS driver builds fail with: Cannot use CONFIG_STACK_VALIDATION=y, please
    install libelf-dev, libelf-devel or elfutils-libelf-devel (LP: #1760876)
    - [Packaging] include the retpoline extractor in the headers

  * retpoline hints: primary infrastructure and initial hints (LP: #1758856)
    - [Packaging] retpoline-extract: flag *0xNNN(%reg) branches
    - x86/speculation, objtool: Annotate indirect calls/jumps for objtool
    - x86/speculation, objtool: Annotate indirect calls/jumps for objtool on 32bit
    - x86/paravirt, objtool: Annotate indirect calls
    - [Packaging] retpoline -- add safe usage hint support
    - [Packaging] retpoline-check -- only report additions
    - [Packaging] retpoline -- widen indirect call/jmp detection
    - [Packaging] retpoline -- elide %rip relative indirections
    - [Packaging] retpoline -- clear hint information from packages
    - KVM: x86: Make indirect calls in emulator speculation safe
    - KVM: VMX: Make indirect call speculation safe
    - x86/boot, objtool: Annotate indirect jump in secondary_startup_64()
    - SAUCE: early/late -- annotate indirect calls in early/late initialisation
      code
    - SAUCE: vga_set_mode -- avoid jump tables
    - [Config] retpoline -- switch to new format
    - [Packaging] retpoline hints -- handle missing files when RETPOLINE not
      enabled
    - [Packaging] final-checks -- remove check for empty retpoline files

  * retpoline: ignore %cs:0xNNN constant indirections (LP: #1752655)
    - [Packaging] retpoline -- elide %cs:0xNNNN constants on i386

  * zfs system process hung on container stop/delete (LP: #1754584)
    - SAUCE: Fix non-prefaulted page deadlock (LP: #1754584)

  * zfs-linux 0.6.5.11-1ubuntu5 ADT test failure with linux 4.15.0-1.2
    (LP: #1737761)
    - SAUCE: (noup) Update zfs to 0.6.5.11-1ubuntu3.2

  * AT_BASE_PLATFORM in AUXV is absent on kernels available on Ubuntu 17.10
    (LP: #1759312)
    - powerpc/64s: Fix NULL AT_BASE_PLATFORM when using DT CPU features

  * btrfs and tar sparse truncate archives (LP: #1757565)
    - Btrfs: move definition of the function btrfs_find_new_delalloc_bytes
    - Btrfs: fix reported number of inode blocks after buffered append writes

  * efifb broken on ThunderX-based Gigabyte nodes (LP: #1758375)
    - drivers/fbdev/efifb: Allow BAR to be moved instead of claiming it

  * Intel i40e PF reset due to incorrect MDD detection (continues...)
    (LP: #1723127)
    - i40e/i40ev...

Read more...

Changed in linux (Ubuntu Artful):
status: Fix Committed → Fix Released
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-bionic' to 'verification-done-bionic'. If the problem still exists, change the tag 'verification-needed-bionic' to 'verification-failed-bionic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-bionic
Andy Whitcroft (apw) on 2019-02-14
tags: added: kernel-fixup-verification-needed-bionic
removed: verification-needed-bionic
Andy Whitcroft (apw) wrote :

This bug was erroneously marked for verification in bionic; verification is not required and verification-needed-bionic is being removed.

tags: added: verification-done-bionic
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers