arm64: Unfair rwlock can stall the system

Bug Description

There is a long-standing upstream bug with the ARM64 specific implementation of RW locks. The implementation can starve writers under lock contention leading to RCU stalls, driver timeouts and general system instability.

[Test Case]
$ stress-ng --kill 0 -t 300 -v

You'll see the console fill with messages like:

[ 2534.423119] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 2534.428606] 192-...: (1 ticks this GP) idle=b6e/140000000000000/0 softirq=578/578 fqs=6770
[ 2534.437029] (detected by 0, t=15005 jiffies, g=1479, c=1478, q=473)
[ 2714.623691] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 2714.629181] 192-...: (1 ticks this GP) idle=b6e/140000000000000/0 softirq=578/578 fqs=12819
[ 2714.637692] (detected by 116, t=60058 jiffies, g=1479, c=1478, q=1736)
[ 2747.216955] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [kworker/0:5:1464]
[ 2775.399061] watchdog: BUG: soft lockup - CPU#13 stuck for 123s! [systemd-network:2936]

[Regression Risk]
The proposed fix for this comprises clean cherry-picks from the v4.15 merge window. The code modified for this fix is restricted to x86 & arm64, as they are the only Ubuntu architectures that define ARCH_USE_QUEUED_LOCKS. Fix verified on a 228 CPU arm64 (ThunderX2) server and regression tested on a 128-cpu x86 system using stress-ng and locktorture.

dann frazier (dannf) on 2017-11-14
dann frazier (dannf) wrote :

This has been resolved with the following commits upstream:

commit d133166146333e1f13fc81c0e6c43c8d99290a8a
Author: Will Deacon <email address hidden>

    locking/qrwlock: Prevent slowpath writers getting held up by fastpath

commit 087133ac90763cd339b6b67f2998f87dcc136c52
Author: Will Deacon <email address hidden>

    locking/qrwlock, arm64: Move rwlock implementation over to qrwlocks

commit b519b56e378ee82caf9b079b04f5db87dedc3251
Author: Will Deacon <email address hidden>

    locking/qrwlock: Use atomic_cond_read_acquire() when spinning in qrwlock

commit 4df714be4dcf40bfb0d4af0f851a6e1977afa02e
Author: Will Deacon <email address hidden>

    locking/atomic: Add atomic_cond_read_acquire()

commit e0d02285f16e8d5810f3d5d5e8a5886ca0015d3b
Author: Will Deacon <email address hidden>

    locking/qrwlock: Use 'struct qrwlock' instead of 'struct __qrwlock

dann frazier (dannf) on 2018-01-04
dann frazier (dannf) wrote :
Download full text (6.6 KiB)

Example showing driver timeouts:

ubuntu@boomer:~$ stress-ng --kill 0 -t 300 -v
stress-ng: debug: [3344] 224 processors online, 224 processors configured
stress-ng: info: [3344] dispatching hogs: 224 kill
stress-ng: debug: [3344] /sys/devices/system/cpu/cpu0/cache does not exist
stress-ng: info: [3344] cache allocate: using built-in defaults as unable to determine cache details
stress-ng: info: [3344] cache allocate: default cache size: 2048K
stress-ng: debug: [3344] starting stressors
stress-ng: debug: [3345] stress-ng-kill: started [3345] (instance 0)
stress-ng: debug: [3346] stress-ng-kill: started [3346] (instance 1)
stress-ng: debug: [3347] stress-ng-kill: started [3347] (instance 2)
stress-ng: debug: [3348] stress-ng-kill: started [3348] (instance 3)
stress-ng: debug: [3349] stress-ng-kill: started [3349] (instance 4)
[ 1447.474535] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 1447.480020] 27-...: (66 GPs behind) idle=1ba/140000000000000/0 softirq=3878/3878 fqs=7264
[ 1447.488363] 136-...: (93 GPs behind) idle=972/140000000000000/0 softirq=2760/2760 fqs=7265
[ 1447.496788] (detected by 161, t=15007 jiffies, g=1128, c=1127, q=790)
[ 1451.646152] xhci_hcd 0000:01:04.1: xHCI host controller not responding, assume dead
[ 1451.653819] xhci_hcd 0000:01:04.1: HC died; cleaning up
[ 1451.653829] usb 3-1-port1: cannot reset (err = -22)
[ 1451.653832] usb 3-1-port1: cannot reset (err = -22)
[ 1451.653833] usb 3-1-port1: cannot reset (err = -22)
[ 1451.653834] usb 3-1-port1: cannot reset (err = -22)
[ 1451.653835] usb 3-1-port1: cannot reset (err = -22)
[ 1451.653837] usb 3-1-port1: Cannot enable. Maybe the USB cable is bad?
[ 1451.653839] usb 3-1-port1: cannot disable (err = -22)
[ 1451.653848] usb 3-1-port2: cannot reset (err = -22)
[ 1451.653851] usb 3-1-port2: cannot reset (err = -22)
[ 1451.653852] usb 3-1-port2: cannot reset (err = -22)
[ 1451.653854] usb 3-1-port2: cannot reset (err = -22)
[ 1451.653855] usb 3-1-port2: cannot reset (err = -22)
[ 1451.653856] usb 3-1-port2: Cannot enable. Maybe the USB cable is bad?
[ 1451.653858] usb 3-1-port2: cannot disable (err = -22)
[ 1451.653860] usb 3-1-port1: cannot reset (err = -22)
[ 1451.653861] usb 3-1-port1: cannot reset (err = -22)
[ 1451.653862] usb 3-1-port1: cannot reset (err = -22)
[ 1451.653863] usb 3-1-port1: cannot reset (err = -22)
[ 1451.653864] usb 3-1-port1: cannot reset (err = -22)
[ 1451.653865] usb 3-1-port1: Cannot enable. Maybe the USB cable is bad?
[ 1451.653866] usb 3-1-port1: cannot disable (err = -22)
[ 1451.653868] usb 3-1-port2: cannot reset (err = -22)
[ 1451.653870] usb 3-1-port2: cannot reset (err = -22)
[ 1451.653871] usb 3-1-port2: cannot reset (err = -22)
[ 1451.653872] usb 3-1-port2: cannot reset (err = -22)
[ 1451.653873] usb 3-1-port2: cannot reset (err = -22)
[ 1451.653873] usb 3-1-port2: Cannot enable. Maybe the USB cable is bad?
[ 1451.653875] usb 3-1-port2: cannot disable (err = -22)
[ 1451.653876] usb 3-1-port1: cannot reset (err = -22)
[ 1451.653878] usb 3-1-port1: cannot reset (err = -22)
[ 1451.653879] usb 3-1-port1: cannot reset (err = -22)
[ 1451.653880] usb 3-1-port1: cannot reset (err = -22)
[ 1451.653881] usb 3-1-port1: cannot reset (err ...


dann frazier (dannf) on 2018-01-05
Seth Forshee (sforshee) on 2018-01-10
dann frazier (dannf) wrote :

Artful verification: I was able to successfully run the above stress-ng commmand w/o any errors on the console.

