Ubuntu 18.04- call trace in kernel buffer when unloading ib_ipoib module

Bug #1904848 reported by Amir Tzin on 2020-11-19
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Undecided
Unassigned
Bionic
Medium
Ian

Bug Description

[Impact]
unloading ib_ipoib causes a call trace to be logged in kernel buffer.

bisecting the bionic kernel reveals that this issue was discovered by
616e695435e3 workqueue: Try to catch flush_work() without INIT_WORK()
in version 4.15.0-59.66

[test case]

# modprobe ib_ipoib
# modprobe ib_ipoib -r
# dmesg
[ 306.277717] ------------[ cut here ]------------
[ 306.277738] WARNING: CPU: 10 PID: 2148 at /build/linux-RJNBJC/linux-4.15.0/kernel/workqueue.c:2906 __flush_work+0x1f8/0x210
[ 306.277739] Modules linked in: nfsv3 nfs fscache xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack libcrc32c ipt_REJECT nf_reject_ipv4 xt_tcpudp ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter bridge stp llc binfmt_misc intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp rpcrdma rdma_ucm ib_umad ib_uverbs coretemp ib_iser rdma_cm kvm_intel kvm iw_cm irqbypass ib_ipoib(-) libiscsi scsi_transport_iscsi ib_cm joydev input_leds crct10dif_pclmul crc32_pclmul mgag200 ttm drm_kms_helper drm hpilo ghash_clmulni_intel pcbc i2c_algo_bit ipmi_ssif fb_sys_fops syscopyarea sysfillrect sysimgblt aesni_intel aes_x86_64 crypto_simd ioatdma glue_helper shpchp cryptd dca intel_cstate intel_rapl_perf
[ 306.277790] serio_raw acpi_power_meter lpc_ich mac_hid ipmi_si ipmi_devintf ipmi_msghandler nfsd auth_rpcgss nfs_acl lockd grace sunrpc sch_fq_codel ip_tables x_tables autofs4 mlx5_ib mlx4_ib mlx4_en ib_core hid_generic psmouse mlx5_core usbhid hid pata_acpi hpsa tg3 mlxfw mlx4_core scsi_transport_sas ptp pps_core devlink
[ 306.277817] CPU: 10 PID: 2148 Comm: modprobe Not tainted 4.15.0-124-generic #127-Ubuntu
[ 306.277818] Hardware name: HP ProLiant DL380p Gen8, BIOS P70 07/01/2015
[ 306.277823] RIP: 0010:__flush_work+0x1f8/0x210
[ 306.277825] RSP: 0018:ffffbdeb47ecfcd8 EFLAGS: 00010286
[ 306.277827] RAX: 0000000000000024 RBX: ffff993a5c3d8ec8 RCX: 0000000000000006
[ 306.277829] RDX: 0000000000000000 RSI: ffff99429ef16498 RDI: ffff99429ef16490
[ 306.277830] RBP: ffffbdeb47ecfd48 R08: 000000000000050d R09: 0000000000000004
[ 306.277832] R10: ffffe263a058c1c0 R11: 0000000000000001 R12: ffff993a5c3d8ec8
[ 306.277833] R13: 0000000000000001 R14: ffffbdeb47ecfd78 R15: ffffffffb00a9800
[ 306.277835] FS: 00007fa1124a9540(0000) GS:ffff99429ef00000(0000) knlGS:0000000000000000
[ 306.277837] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 306.277839] CR2: 000055b1c5007bb0 CR3: 0000000fcf36c002 CR4: 00000000001606e0
[ 306.277840] Call Trace:
[ 306.277850] __cancel_work_timer+0x136/0x1b0
[ 306.277881] ? mlx5_core_destroy_qp+0x99/0xd0 [mlx5_core]
[ 306.277886] cancel_delayed_work_sync+0x13/0x20
[ 306.277909] mlx5e_detach_netdev+0x83/0x90 [mlx5_core]
[ 306.277931] mlx5_rdma_netdev_free+0x30/0x80 [mlx5_core]
[ 306.277941] mlx5_ib_free_rdma_netdev+0xe/0x10 [mlx5_ib]
[ 306.277948] ipoib_remove_one+0xe4/0x180 [ib_ipoib]
[ 306.277965] ib_unregister_client+0x171/0x1e0 [ib_core]
[ 306.277972] ipoib_cleanup_module+0x15/0x2f [ib_ipoib]
[ 306.277978] SyS_delete_module+0x1ab/0x2d0
[ 306.277983] do_syscall_64+0x73/0x130
[ 306.277989] entry_SYSCALL_64_after_hwframe+0x41/0xa6
[ 306.277992] RIP: 0033:0x7fa111fc1047
[ 306.277993] RSP: 002b:00007ffc0db32298 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
[ 306.277996] RAX: ffffffffffffffda RBX: 00005614be46cca0 RCX: 00007fa111fc1047
[ 306.277997] RDX: 0000000000000000 RSI: 0000000000000800 RDI: 00005614be46cd08
[ 306.277999] RBP: 00005614be46cca0 R08: 00007ffc0db31241 R09: 0000000000000000
[ 306.278000] R10: 00007fa11203dc40 R11: 0000000000000206 R12: 00005614be46cd08
[ 306.278002] R13: 0000000000000001 R14: 00005614be46cd08 R15: 00007ffc0db33680
[ 306.278004] Code: 24 03 80 c9 f0 e9 5b ff ff ff 48 c7 c7 18 50 0b b1 e8 ed 66 04 00 0f 0b 31 c0 e9 75 ff ff ff 48 c7 c7 18 50 0b b1 e8 d8 66 04 00 <0f> 0b 31 c0 e9 60 ff ff ff e8 5a 35 fe ff 66 2e 0f 1f 84 00 00
[ 306.278035] ---[ end trace 652f7759937172a2 ]---
[ 306.646061] ------------[ cut here ]------------
[ 306.646077] WARNING: CPU: 6 PID: 2148 at /build/linux-RJNBJC/linux-4.15.0/kernel/workqueue.c:2906 __flush_work+0x1f8/0x210
[ 306.646078] Modules linked in: nfsv3 nfs fscache xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack libcrc32c ipt_REJECT nf_reject_ipv4 xt_tcpudp ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter bridge stp llc binfmt_misc intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp rpcrdma rdma_ucm ib_umad ib_uverbs coretemp ib_iser rdma_cm kvm_intel kvm iw_cm irqbypass ib_ipoib(-) libiscsi scsi_transport_iscsi ib_cm joydev input_leds crct10dif_pclmul crc32_pclmul mgag200 ttm drm_kms_helper drm hpilo ghash_clmulni_intel pcbc i2c_algo_bit ipmi_ssif fb_sys_fops syscopyarea sysfillrect sysimgblt aesni_intel aes_x86_64 crypto_simd ioatdma glue_helper shpchp cryptd dca intel_cstate intel_rapl_perf
[ 306.646123] serio_raw acpi_power_meter lpc_ich mac_hid ipmi_si ipmi_devintf ipmi_msghandler nfsd auth_rpcgss nfs_acl lockd grace sunrpc sch_fq_codel ip_tables x_tables autofs4 mlx5_ib mlx4_ib mlx4_en ib_core hid_generic psmouse mlx5_core usbhid hid pata_acpi hpsa tg3 mlxfw mlx4_core scsi_transport_sas ptp pps_core devlink
[ 306.646146] CPU: 6 PID: 2148 Comm: modprobe Tainted: G W 4.15.0-124-generic #127-Ubuntu
[ 306.646148] Hardware name: HP ProLiant DL380p Gen8, BIOS P70 07/01/2015
[ 306.646152] RIP: 0010:__flush_work+0x1f8/0x210
[ 306.646154] RSP: 0018:ffffbdeb47ecfcd8 EFLAGS: 00010286
[ 306.646156] RAX: 0000000000000024 RBX: ffff9942970b8ec8 RCX: 0000000000000006
[ 306.646158] RDX: 0000000000000000 RSI: ffff99429ee16498 RDI: ffff99429ee16490
[ 306.646159] RBP: ffffbdeb47ecfd48 R08: 0000000000000533 R09: 0000000000000004
[ 306.646161] R10: ffffe2639fa66740 R11: 0000000000000001 R12: ffff9942970b8ec8
[ 306.646162] R13: 0000000000000001 R14: ffffbdeb47ecfd78 R15: ffffffffb00a9800
[ 306.646164] FS: 00007fa1124a9540(0000) GS:ffff99429ee00000(0000) knlGS:0000000000000000
[ 306.646166] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 306.646167] CR2: 000055dd889e4a30 CR3: 0000000fcf36c006 CR4: 00000000001606e0
[ 306.646169] Call Trace:
[ 306.646177] __cancel_work_timer+0x136/0x1b0
[ 306.646205] ? mlx5_core_destroy_qp+0x99/0xd0 [mlx5_core]
[ 306.646210] cancel_delayed_work_sync+0x13/0x20
[ 306.646233] mlx5e_detach_netdev+0x83/0x90 [mlx5_core]
[ 306.646255] mlx5_rdma_netdev_free+0x30/0x80 [mlx5_core]
[ 306.646264] mlx5_ib_free_rdma_netdev+0xe/0x10 [mlx5_ib]
[ 306.646271] ipoib_remove_one+0xe4/0x180 [ib_ipoib]
[ 306.646287] ib_unregister_client+0x171/0x1e0 [ib_core]
[ 306.646295] ipoib_cleanup_module+0x15/0x2f [ib_ipoib]
[ 306.646300] SyS_delete_module+0x1ab/0x2d0
[ 306.646305] do_syscall_64+0x73/0x130
[ 306.646310] entry_SYSCALL_64_after_hwframe+0x41/0xa6
[ 306.646313] RIP: 0033:0x7fa111fc1047
[ 306.646314] RSP: 002b:00007ffc0db32298 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
[ 306.646317] RAX: ffffffffffffffda RBX: 00005614be46cca0 RCX: 00007fa111fc1047
[ 306.646318] RDX: 0000000000000000 RSI: 0000000000000800 RDI: 00005614be46cd08
[ 306.646319] RBP: 00005614be46cca0 R08: 00007ffc0db31241 R09: 0000000000000000
[ 306.646321] R10: 00007fa11203dc40 R11: 0000000000000206 R12: 00005614be46cd08
[ 306.646322] R13: 0000000000000001 R14: 00005614be46cd08 R15: 00007ffc0db33680
[ 306.646325] Code: 24 03 80 c9 f0 e9 5b ff ff ff 48 c7 c7 18 50 0b b1 e8 ed 66 04 00 0f 0b 31 c0 e9 75 ff ff ff 48 c7 c7 18 50 0b b1 e8 d8 66 04 00 <0f> 0b 31 c0 e9 60 ff ff ff e8 5a 35 fe ff 66 2e 0f 1f 84 00 00
[ 306.646355] ---[ end trace 652f7759937172a3 ]---

[Fix]
the root cause for this error is canceling uninitialized delayed_work_queue belongs to ipoib net devices and the solution is not failing to initialize it.
this solution is specified in the very small patched (one line) attached.
please note that this patch is not upstream and it is based on the following upstream commits which introduced similar functionality to upstream v4.20-rc1.

303211b44ce3 net/mlx5e: Always initialize update stats delayed work
182570b26223 net/mlx5e: Gather common netdev init/cleanup functionality in one place

applying this two on the bionic tree in a clean way requires more patches that might introduce a large change so I think it's better (if possible) to use the attached patch.

[Regression Potential]
Regression risk is low since it's introduce a small fix that was also accepted upstream in v4.20.

CVE References

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1904848

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: bionic
Amir Tzin (amirtz) on 2020-11-19
description: updated
tags: added: patch
Jeff Lane (bladernr) on 2020-11-19
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Stefan Bader (smb) on 2020-11-26
Changed in linux (Ubuntu Bionic):
assignee: nobody → Kamal Mostafa (kamalmostafa)
importance: Undecided → Medium
status: New → In Progress
Changed in linux (Ubuntu):
status: Confirmed → Invalid
Changed in linux (Ubuntu Bionic):
assignee: Kamal Mostafa (kamalmostafa) → Ian (ian-may)
Changed in linux (Ubuntu Bionic):
status: In Progress → Fix Committed

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-bionic' to 'verification-done-bionic'. If the problem still exists, change the tag 'verification-needed-bionic' to 'verification-failed-bionic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-bionic

Amir, would you be able to verify whether the fix commit resolves this bug?

Amir Tzin (amirtz) wrote :

William,
yes, I built a bionic kernel with this commit included and the bug was resolved.

Amir Tzin (amirtz) wrote :

I also tested with kernel Ubuntu-4.15.0-129.132 downloaded from http://archive.ubuntu.com/ubuntu/pool/main/l/linux/
which has the commit and the issue is resolved

Terry Rudd (terrykrudd) wrote :

thanks Amir, I believe you were on vacation so really appreciate it

Updated tag to verification-done-bionic. Thanks so much, Amir!

tags: added: verification-done-bionic
removed: verification-needed-bionic
Launchpad Janitor (janitor) wrote :
Download full text (28.8 KiB)

This bug was fixed in the package linux - 4.15.0-129.132

---------------
linux (4.15.0-129.132) bionic; urgency=medium

  * bionic/linux: 4.15.0-129.132 -proposed tracker (LP: #1907635)

  * Packaging resync (LP: #1786013)
    - update dkms package versions

  * Ubuntu 18.04- call trace in kernel buffer when unloading ib_ipoib module
    (LP: #1904848)
    - SAUCE: net/mlx5e: IPoIB, initialize update_stat_work for ipoib devices

  * memory is leaked when tasks are moved to net_prio (LP: #1886859)
    - netprio_cgroup: Fix unlimited memory leak of v2 cgroups

  * s390: dbginfo.sh triggers kernel panic, reading from
    /sys/kernel/mm/page_idle/bitmap (LP: #1904884)
    - mm/page_idle.c: skip offline pages

  * Bionic update: upstream stable patchset 2020-11-23 (LP: #1905333)
    - drm/i915: Break up error capture compression loops with cond_resched()
    - tipc: fix use-after-free in tipc_bcast_get_mode
    - gianfar: Replace skb_realloc_headroom with skb_cow_head for PTP
    - gianfar: Account for Tx PTP timestamp in the skb headroom
    - net: usb: qmi_wwan: add Telit LE910Cx 0x1230 composition
    - sctp: Fix COMM_LOST/CANT_STR_ASSOC err reporting on big-endian platforms
    - sfp: Fix error handing in sfp_probe()
    - Blktrace: bail out early if block debugfs is not configured
    - i40e: Fix of memory leak and integer truncation in i40e_virtchnl.c
    - Fonts: Replace discarded const qualifier
    - ALSA: usb-audio: Add implicit feedback quirk for Qu-16
    - lib/crc32test: remove extra local_irq_disable/enable
    - kthread_worker: prevent queuing delayed work from timer_fn when it is being
      canceled
    - mm: always have io_remap_pfn_range() set pgprot_decrypted()
    - gfs2: Wake up when sd_glock_disposal becomes zero
    - ftrace: Fix recursion check for NMI test
    - ftrace: Handle tracing when switching between context
    - tracing: Fix out of bounds write in get_trace_buf
    - futex: Handle transient "ownerless" rtmutex state correctly
    - ARM: dts: sun4i-a10: fix cpu_alert temperature
    - x86/kexec: Use up-to-dated screen_info copy to fill boot params
    - of: Fix reserved-memory overlap detection
    - blk-cgroup: Fix memleak on error path
    - blk-cgroup: Pre-allocate tree node on blkg_conf_prep
    - scsi: core: Don't start concurrent async scan on same host
    - vsock: use ns_capable_noaudit() on socket create
    - drm/vc4: drv: Add error handding for bind
    - ACPI: NFIT: Fix comparison to '-ENXIO'
    - vt: Disable KD_FONT_OP_COPY
    - fork: fix copy_process(CLONE_PARENT) race with the exiting ->real_parent
    - serial: 8250_mtk: Fix uart_get_baud_rate warning
    - serial: txx9: add missing platform_driver_unregister() on error in
      serial_txx9_init
    - USB: serial: cyberjack: fix write-URB completion race
    - USB: serial: option: add Quectel EC200T module support
    - USB: serial: option: add LE910Cx compositions 0x1203, 0x1230, 0x1231
    - USB: serial: option: add Telit FN980 composition 0x1055
    - USB: Add NO_LPM quirk for Kingston flash drive
    - usb: mtu3: fix panic in mtu3_gadget_stop()
    - ARC: stack unwinding: avoid indefinite looping
    - Revert "ARC: entry: fix potential EFA c...

Changed in linux (Ubuntu Bionic):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers