Comment 4 for bug 1881109

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2020-06-17 06:59 EDT-------
O.K. we have some new insights here.

@<email address hidden> did some experiments on my behalf with a slightly modified Ubuntu kernel (based on 5.4.0-29) where I removed commit 3060781f2664 ("s390/qdio: allow to scan all Output SBALs in one go") (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3060781f2664d34af641247aeac62696405a3fde). We had a suspicion that this might be related to the queue-stalls/-slowdowns we always saw in the past before the crash in the WBT code. And to my slight suprise, not only did the queue-stalls/-slowdowns disappear, but the WBT crash still persisted. I checked all available logs and our driver traces from the dump and didn't find any indication what so ever that scsi-EH was ever invoked, nor that we went through adapter-recovery any time after the initial instance boot - no command timeouts or anything.

So in my mind, while I can't proof yet that 3060781f2664 was really responsible for the queue-stalls/-slowdowns - that might still just be coincidence (although, it *did* happen quiet persistently before, and now not once.. so that is rather suspicious for me) - it shows that the crash in the WBT code is independent. So that seems to be something that can happen without any transport interruptions.

Here is the backtrace from that particular run, where no queue-stalls/-slowdowns were seen, but WBT still crashed:

[22808.815235] Unable to handle kernel pointer dereference in virtual kernel address space
[22808.815247] Failing address: 00007fe010a50000 TEID: 00007fe010a50403
[22808.815249] Fault in home space mode while using kernel ASCE.
[22808.815252] AS:000003dc9b67c00b R2:000003fd0000800b R3:000003fd0000c007 S:000003fba9b84800 P:0000000000000400
[22808.815368] Oops: 0011 ilc:2 [#1] SMP
[22808.815376] Modules linked in: xfs vhost_net vhost macvtap macvlan tap xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_mangle ip6table_nat iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetlink ip6table_filter ip6_tables iptable_filter bpfilter bridge dm_service_time aufs overlay dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua s390_trng chsc_sch eadm_sch vfio_ccw vfio_mdev mdev vfio_iommu_type1 vfio 8021q garp mrp stp llc sch_fq_codel drm drm_panel_orientation_quirks i2c_core ip_tables x_tables btrfs zstd_compress zlib_deflate raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 linear dm_mirror dm_region_hash dm_log qeth_l2 pkey zcrypt crc32_vx_s390 ghash_s390 prng aes_s390 des_s390 libdes sha3_512_s390 sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 sha_common zfcp scsi_transport_fc dasd_eckd_mod dasd_mod qeth qdio ccwgroup
[22808.815519] CPU: 14 PID: 185372 Comm: CPU 0/KVM Kdump: loaded Not tainted 5.4.0-2901-generic #01
[22808.815521] Hardware name: IBM 8561 T01 708 (LPAR)
[22808.815523] Krnl PSW : 0404e00180000000 000003dc9a6dd9be (try_to_wake_up+0x4e/0x700)
[22808.815535] R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
[22808.815602] Krnl GPRS: 000003fbd9ab7588 00007fe000000000 00007fe00000000f 0000000000000003
[22808.815605] 0000000000000000 0000000000000039 04007fe001ef7a88 0000000000000003
[22808.815607] 0000000000000003 00007fe010a50284 0000000000000000 00007fe010a4f930
[22808.815609] 000003f588fd6600 000003dc9af0f070 00007fe001ef7ae0 00007fe001ef7a60
[22808.815620] Krnl Code: 000003dc9a6dd9b2: 41902954 la %r9,2388(%r2)
000003dc9a6dd9b6: 582003ac l %r2,940
#000003dc9a6dd9ba: a7180000 lhi %r1,0
>000003dc9a6dd9be: ba129000 cs %r1,%r2,0(%r9)
000003dc9a6dd9c2: a77401c9 brc 7,000003dc9a6ddd54
000003dc9a6dd9c6: e310b0080004 lg %r1,8(%r11)
000003dc9a6dd9cc: b9800018 ngr %r1,%r8
000003dc9a6dd9d0: a774001f brc 7,000003dc9a6dda0e
[22808.815637] Call Trace:
[22808.816011] ([<00007fff809861e8>] __key.84156+0x10/0xfffffffffffb7e28 [xfs])
[22808.816022] [<000003dc9ab596ba>] rq_qos_wake_function+0x8a/0xa0
[22808.816025] [<000003dc9a6fcbde>] __wake_up_common+0x9e/0x1b0
[22808.816028] [<000003dc9a6fd0e4>] __wake_up_common_lock+0x94/0xe0
[22808.816029] [<000003dc9a6fd15a>] __wake_up+0x2a/0x40
[22808.816034] [<000003dc9ab70640>] wbt_done+0x90/0xe0
[22808.816036] [<000003dc9ab597be>] __rq_qos_done+0x3e/0x60
[22808.816040] [<000003dc9ab455b0>] blk_mq_free_request+0xe0/0x140
[22808.816045] [<000003dc9ace7c60>] dm_softirq_done+0x140/0x230
[22808.816046] [<000003dc9ab43fbc>] blk_done_softirq+0xbc/0xe0
[22808.816051] [<000003dc9af06710>] __do_softirq+0x100/0x360
[22808.816054] [<000003dc9a6ad25e>] irq_exit+0x9e/0xc0
[22808.816057] [<000003dc9a638b18>] do_IRQ+0x78/0xb0
[22808.816059] [<000003dc9af05c28>] ext_int_handler+0x128/0x12c
[22808.816060] [<000003dc9af05306>] sie_exit+0x0/0x46
[22808.816065] ([<000003dc9a67144a>] __vcpu_run+0x27a/0xc30)
[22808.816068] [<000003dc9a67a9a8>] kvm_arch_vcpu_ioctl_run+0x2d8/0x840
[22808.816072] [<000003dc9a665242>] kvm_vcpu_ioctl+0x282/0x770
[22808.816077] [<000003dc9a90df66>] do_vfs_ioctl+0x376/0x690
[22808.816078] [<000003dc9a90e304>] ksys_ioctl+0x84/0xb0
[22808.816080] [<000003dc9a90e39a>] __s390x_sys_ioctl+0x2a/0x40
[22808.816082] [<000003dc9af055f2>] system_call+0x2a6/0x2c8
[22808.816084] Last Breaking-Event-Address:
[22808.816087] [<000003dc9a6de07e>] wake_up_process+0xe/0x20
[22808.816159] Kernel panic - not syncing: Fatal exception in interrupt

For the moment I'll stop chasing after the WBT crash - I already provided a work-around for it with the udev-rule I wrote before (although, I might note, I have no idea what this workaround means for performance.. WBT is primarily a performance feature that is intended to stop I/O starvation in face of excessive page-cache writeback from individual tasks). We will definitely continue working on the queue-stalls/-slowdowns, but that seems now independent from this particular bug-report.