Kernel panic on `5.4.0-1033-gke` (Kernel panic - not syncing: Aiee, killing interrupt handler!) possibly iscsi related

Bug #1921825 reported by Khaled El Mously
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux-gke (Ubuntu)
New
Undecided
Khaled El Mously
Focal
Fix Released
Undecided
Khaled El Mously

Bug Description

[Impact]
Kernel panic during high iscsi activity

This stacktrace

[ 223.386958] BUG: scheduling while atomic: iscsiadm/18136/0x00000200
[ 223.393390] Modules linked in: tcp_diag inet_diag xt_nat ipt_REJECT nf_reject_ipv4 xt_tcpudp ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs sch_htb ebt_ip ebtable_filter ebtables veth xt_mark br_netfilter iptable_mangle xt_MASQUERADE xt_comment xt_addrtype iptable_nat binfmt_misc iptable_filter bpfilter xt_conntrack nf_nat bridge stp llc xfrm_user xfrm_algo aufs overlay nls_iso8859_1 dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd glue_helper input_leds serio_raw sch_fq_codel sunrpc ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi virtio_rng ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq raid1 raid0 multipath linear psmouse virtio_net net_failover failover
[ 223.393429] CPU: 6 PID: 18136 Comm: iscsiadm Kdump: loaded Not tainted 5.4.0-1033-gke #35~18.04.1-Ubuntu
[ 223.393430] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
[ 223.393430] Call Trace:
[ 223.393439] dump_stack+0x6d/0x95
[ 223.393464] __schedule_bug+0x55/0x70
[ 223.393467] __schedule+0x61b/0x710
[ 223.393469] schedule+0x33/0xa0
[ 223.393472] __lock_sock+0x7d/0xc0
[ 223.393475] ? wait_woken+0x80/0x80
[ 223.393477] lock_sock_nested+0x64/0x70
[ 223.393479] inet_getname+0xaa/0xe0
[ 223.393482] kernel_getpeername+0x1b/0x20
[ 223.393485] iscsi_sw_tcp_conn_get_param+0xa6/0x110 [iscsi_tcp]
[ 223.393494] show_conn_ep_param_ISCSI_PARAM_CONN_ADDRESS+0x7e/0xa0 [scsi_transport_iscsi]
[ 223.393496] dev_attr_show+0x1d/0x50
[ 223.393499] sysfs_kf_seq_show+0xa1/0x110
[ 223.393502] kernfs_seq_show+0x27/0x30
[ 223.393504] seq_read+0xda/0x420
[ 223.393506] kernfs_fop_read+0x141/0x1a0
[ 223.393510] __vfs_read+0x1b/0x40
[ 223.393512] vfs_read+0x8e/0x130
[ 223.393513] ksys_read+0xa7/0xe0
[ 223.393515] __x64_sys_read+0x1a/0x20
[ 223.393518] do_syscall_64+0x57/0x190
[ 223.393521] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 223.393523] RIP: 0033:0x7f45793ce910
[ 223.393525] Code: b6 fe ff ff 48 8d 3d 0f be 08 00 48 83 ec 08 e8 06 db 01 00 66 0f 1f 44 00 00 83 3d f9 2d 2c 00 00 75 10 b8 00 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 de 9b 01 00 48 89 04 24
[ 223.393526] RSP: 002b:00007ffd9fa13688 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[ 223.393527] RAX: ffffffffffffffda RBX: 00007ffd9fa13820 RCX: 00007f45793ce910
[ 223.393528] RDX: 0000000000000100 RSI: 00007ffd9fa13720 RDI: 0000000000000003
[ 223.393528] RBP: 00007ffd9fa13720 R08: 0000000000000000 R09: 0000000000000000
[ 223.393529] R10: 0000000000000064 R11: 0000000000000246 R12: 0000000000000003
[ 223.393530] R13: 00007ffd9fa13c60 R14: 0000555b0d613708 R15: 0000555b0d613300
[ 223.393581] sd 1:0:0:0: [sdb] Write Protect is off
[ 223.393583] sd 1:0:0:0: [sdb] Mode Sense: 43 00 10 08
[ 223.393660] iscsiadm[18136]: segfault at 7ffd9fa12e58 ip 0000555b0ccd95af sp 00007ffd9fa12e60 error 6 in iscsiadm[555b0ccb4000+58000]
[ 223.393666] Code: ba 00 02 00 00 48 81 ec 10 04 00 00 48 89 e7 48 8d 9c 24 00 02 00 00 64 48 8b 04 25 28 00 00 00 48 89 84 24 08 04 00 00 31 c0 <e8> 3c ed 00 00 ba 00 02 00 00 4c 89 ee 48 89 e7 e8 6c ed 00 00 ba
[ 223.394992] sd 1:0:0:0: alua: transition timeout set to 60 seconds
[ 223.394997] sd 1:0:0:0: alua: port group 02 state N non-preferred supports TOlUSNA
[ 223.395018] sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, supports DPO and FUA
[ 223.395435] Kernel panic - not syncing: Aiee, killing interrupt handler!
[ 223.396802] sd 1:0:0:0: [sdb] Optimal transfer size 262144 bytes
[ 223.402387] CPU: 6 PID: 18136 Comm: iscsiadm Kdump: loaded Tainted: G W 5.4.0-1033-gke #35~18.04.1-Ubuntu
[ 223.402388] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
[ 223.402389] Call Trace:
[ 223.402395] dump_stack+0x6d/0x95
[ 223.402398] panic+0xfe/0x2e4
[ 223.402400] do_exit+0x899/0xb90
[ 223.402402] do_group_exit+0x43/0xa0
[ 223.402406] get_signal+0x14f/0x860
[ 223.402409] do_signal+0x34/0x6d0
[ 223.402414] ? __bad_area_nosemaphore+0x149/0x1f0
[ 223.457597] exit_to_usermode_loop+0x8e/0x100
[ 223.462090] prepare_exit_to_usermode+0x91/0xa0
[ 223.466782] retint_user+0x8/0x8
[ 223.470131] RIP: 0033:0x555b0ccd95af
[ 223.473827] Code: ba 00 02 00 00 48 81 ec 10 04 00 00 48 89 e7 48 8d 9c 24 00 02 00 00 64 48 8b 04 25 28 00 00 00 48 89 84 24 08 04 00 00 31 c0 <e8> 3c ed 00 00 ba 00 02 00 00 4c 89 ee 48 89 e7 e8 6c ed 00 00 ba
[ 223.492991] RSP: 002b:00007ffd9fa12e60 EFLAGS: 00010246
[ 223.498333] RAX: 0000000000000000 RBX: 00007ffd9fa13060 RCX: 00007f45793ce335
[ 223.505579] RDX: 0000000000000200 RSI: 0000555b0cf146a0 RDI: 00007ffd9fa12e60
[ 223.513025] RBP: 00007ffd9fa13ecc R08: 0000000000000000 R09: 0000000080808000
[ 223.520268] R10: 0000000000000075 R11: 0000000000000246 R12: 00007ffd9fa13540
[ 223.527521] R13: 00007ffd9fa13540 R14: 0000000000000200 R15: 0000555b0d613300

Which happens during high iscsi activity

This issue is also identified in linux-5.8, reported here ( https://lkml.org/lkml/2020/7/28/1085 ) for example. It affects the gcp-5.4 kernel specifically because gcp-5.4 has backported '1b66d253610c7 ("bpf: Add get{peer, sock}name attach types for sock_addr")' which introduces the issue.

[Fix]

The fix is https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=bcf3a2953d36bbfb9bd44ccb3db0897d935cc485 from 5.9

[Test]
Affected customer has reported that they can no longer reproduce the problem with this fix applied. They were readily reproducing the crash without it.

[Regression potential]
I'm not aware of any. The patch seems reasonable. It is accepted in mainline and backported to the stable kernels too. It is present in groovy 5.8 as of https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1898853

no longer affects: kernel-sru-workflow
description: updated
Changed in linux-gcp (Ubuntu Focal):
assignee: nobody → Khaled El Mously (kmously)
no longer affects: linux-gcp (Ubuntu)
no longer affects: linux-gcp (Ubuntu Focal)
Changed in linux-gke (Ubuntu):
assignee: nobody → Khaled El Mously (kmously)
Changed in linux-gke (Ubuntu Focal):
assignee: nobody → Khaled El Mously (kmously)
description: updated
Changed in linux-gke (Ubuntu Focal):
status: New → Fix Committed
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-focal' to 'verification-done-focal'. If the problem still exists, change the tag 'verification-needed-focal' to 'verification-failed-focal'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-focal
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (18.8 KiB)

This bug was fixed in the package linux-gke - 5.4.0-1041.43

---------------
linux-gke (5.4.0-1041.43) focal; urgency=medium

  * focal/linux-gke: 5.4.0-1041.43 -proposed tracker (LP: #1922201)

  * Kernel panic on `5.4.0-1033-gke` (Kernel panic - not syncing: Aiee, killing
    interrupt handler!) possibly iscsi related (LP: #1921825)
    - scsi: iscsi: iscsi_tcp: Avoid holding spinlock while calling getpeername()

linux-gke (5.4.0-1040.42) focal; urgency=medium

  * focal/linux-gke: 5.4.0-1040.42 -proposed tracker (LP: #1921027)

  * Enforce CONFIG_DRM_BOCHS=m (LP: #1916290)
    - [Config] [gke] updateconfigs for CONFIG_DRM_BOCHS

  [ Ubuntu: 5.4.0-71.79 ]

  * focal/linux: 5.4.0-71.79 -proposed tracker (LP: #1921040)
  * selftests: bpf verifier fails after sanitize_ptr_alu fixes (LP: #1920995)
    - bpf: Simplify alu_limit masking for pointer arithmetic
    - bpf: Add sanity check for upper ptr_limit
    - bpf, selftests: Fix up some test_verifier cases for unprivileged
  * Packaging resync (LP: #1786013)
    - update dkms package versions
  * Fix missing HDMI/DP audio on NVidia card after S3 (LP: #1918228)
    - ALSA: hda/hdmi: Reduce hda_jack_tbl lookup at unsol event handling
    - ALSA: hda/hdmi: Don't use standard hda_jack for generic HDMI jacks
    - ALSA: hda/hdmi: Move runtime PM resume into hdmi_present_sense_via_verbs()
    - ALSA: hda/hdmi: Move ELD parse and jack reporting into update_eld()
  * Focal update: v5.4.101 upstream stable release (LP: #1918170)
    - HID: make arrays usage and value to be the same
    - USB: quirks: sort quirk entries
    - usb: quirks: add quirk to start video capture on ELMO L-12F document camera
      reliable
    - ntfs: check for valid standard information attribute
    - arm64: tegra: Add power-domain for Tegra210 HDA
    - scripts: use pkg-config to locate libcrypto
    - scripts: set proper OpenSSL include dir also for sign-file
    - mm: unexport follow_pte_pmd
    - mm: simplify follow_pte{,pmd}
    - KVM: do not assume PTE is writable after follow_pfn
    - mm: provide a saner PTE walking API for modules
    - KVM: Use kvm_pfn_t for local PFN variable in hva_to_pfn_remapped()
    - NET: usb: qmi_wwan: Adding support for Cinterion MV31
    - cxgb4: Add new T6 PCI device id 0x6092
    - cifs: Set CIFS_MOUNT_USE_PREFIX_PATH flag on setting cifs_sb->prepath.
    - scripts/recordmcount.pl: support big endian for ARCH sh
    - Linux 5.4.101
  * Focal update: v5.4.100 upstream stable release (LP: #1918168)
    - KVM: SEV: fix double locking due to incorrect backport
    - net: qrtr: Fix port ID for control messages
    - net: bridge: Fix a warning when del bridge sysfs
    - Xen/x86: don't bail early from clear_foreign_p2m_mapping()
    - Xen/x86: also check kernel mapping in set_foreign_p2m_mapping()
    - Xen/gntdev: correct dev_bus_addr handling in gntdev_map_grant_pages()
    - Xen/gntdev: correct error checking in gntdev_map_grant_pages()
    - xen/arm: don't ignore return errors from set_phys_to_machine
    - xen-blkback: don't "handle" error by BUG()
    - xen-netback: don't "handle" error by BUG()
    - xen-scsiback: don't "handle" error by BUG()
    - xen-blkback: fix error handling in ...

Changed in linux-gke (Ubuntu Focal):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.