BUG: scheduling while atomic: containerd-shim

Bug #2045379 reported by Andrew Sergeev
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
ubuntu-realtime
Triaged
Medium
Unassigned

Bug Description

While running dockerized OAI 5G core. No other workload at the node.

$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
bd6158091b03 oaisoftwarealliance/oai-spgwu-tiny:develop "python3 /openair-sp…" 6 weeks ago Up 41 hours (healthy) 2152/udp, 8805/udp, 8080/tcp oai-spgwu
5a573e81812a oaisoftwarealliance/oai-smf:develop "/openair-smf/bin/oa…" 6 weeks ago Up 41 hours (healthy) 80/tcp, 8080/tcp, 8805/udp oai-smf
bf5e2ace6b33 oaisoftwarealliance/oai-amf:develop "/openair-amf/bin/oa…" 6 weeks ago Up 41 hours (healthy) 80/tcp, 8080/tcp, 9090/tcp, 38412/sctp oai-amf
a59daa7e2fd3 oaisoftwarealliance/oai-ausf:develop "/openair-ausf/bin/o…" 6 weeks ago Up 41 hours (healthy) 80/tcp, 8080/tcp oai-ausf
23cdbf34ff64 oaisoftwarealliance/oai-udm:develop "/openair-udm/bin/oa…" 6 weeks ago Up 41 hours (healthy) 80/tcp, 8080/tcp oai-udm
edd58729d58d oaisoftwarealliance/oai-udr:develop "/openair-udr/bin/oa…" 6 weeks ago Up 41 hours (healthy) 80/tcp, 8080/tcp oai-udr
d55de3495580 oaisoftwarealliance/trf-gen-cn5g:jammy "/bin/bash -c ' ip r…" 6 weeks ago Up 41 hours (healthy) oai-ext-dn
f7673911e537 mysql:8.0 "docker-entrypoint.s…" 6 weeks ago Up 41 hours (healthy) 3306/tcp, 33060/tcp mysql
6f82c485ec34 oaisoftwarealliance/ims:latest "asterisk -fp" 6 weeks ago Up 41 hours (healthy) asterisk-ims
6a02be4359ff oaisoftwarealliance/oai-nrf:develop "/openair-nrf/bin/oa…" 6 weeks ago Up 41 hours (healthy) 80/tcp, 8080/tcp, 9090/tcp oai-nrf

$ cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-5.15.0-1051-realtime root=/dev/mapper/ubuntu--vg-ubuntu--lv ro mitigations=off usbcore.autosuspend=-1 intel_iommu=on iommu=pt selinux=0 enforcing=0 nmi_watchdog=0 audit=0 skew_tick=1 nohz=on isolcpus=managed_irq,domain,1-12 nohz_full=1-12 rcu_nocbs=1-12 rcu_nocb_poll intel_pstate=disable nosoftlockup tsc=reliable hugepagesz=1G hugepages=10 default_hugepagesz=1G kthread_cpus=0 irqaffinity=0 megaraid_sas.msix_disable=1

$ cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 143
model name : Intel(R) Xeon(R) Gold 6442Y
stepping : 8
microcode : 0x2b0004d0
cpu MHz : 2601.000
cache size : 61440 KB
physical id : 0
siblings : 24
core id : 0
cpu cores : 24
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 32
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 invpcid_single cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities
vmx flags : vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid ple shadow_vmcs pml ept_mode_based_exec tsc_scaling usr_wait_pause
bugs : spectre_v1 spectre_v2 spec_store_bypass swapgs eibrs_pbrsb
bogomips : 5200.00
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 57 bits virtual
power management:

[650264.311944] BUG: scheduling while atomic: containerd-shim/15151/0x00000002
[650264.311954] Modules linked in: iptable_nat bpfilter sctp ip6_udp_tunnel udp_tunnel macvlan xt_nat veth irdma i40e ice tls xt_tcpudp xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo nft_counter xt_addrtype nft_compat nf_tables nfnetlink br_netfilter bridge stp llc overlay rpcrdma sunrpc rdma_ucm ib_iser libiscsi scsi_transport_iscsi rdma_cm iw_cm ib_cm iavf intel_rapl_msr intel_rapl_common i10nm_edac nfit binfmt_misc x86_pkg_temp_thermal intel_powerclamp coretemp nls_iso8859_1 kvm_intel kvm rapl wmi_bmof dell_smbios dcdbas dell_wmi_descriptor isst_if_mbox_pci ipmi_ssif pmt_telemetry pmt_crashlog pmt_class ib_uverbs idxd isst_if_mmio isst_if_common idxd_bus ib_core mei_me mei acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter mac_hid sch_fq_codel dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456
[650264.312056] async_raid6_recov async_memcpy async_pq async_xor async_tx xor cdc_ether usbnet mii raid6_pq libcrc32c raid1 raid0 multipath linear mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec rc_core crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd ahci i2c_i801 cryptd megaraid_sas drm intel_pmt xhci_pci tg3 libahci i2c_smbus i2c_ismt xhci_pci_renesas wmi pinctrl_emmitsburg [last unloaded: i40e]
[650264.312104] Preemption disabled at:
[650264.312105] [<fffffffface5198a>] fpu_clone+0xaa/0x380
[650264.312123] CPU: 18 PID: 15151 Comm: containerd-shim Tainted: G W 5.15.0-1051-realtime #57-Ubuntu
[650264.312130] Hardware name: Dell Inc. PowerEdge R660/0R5JJC, BIOS 1.6.6 09/20/2023
[650264.312133] Call Trace:
[650264.312136] <TASK>
[650264.312141] show_stack+0x52/0x5c
[650264.312151] dump_stack_lvl+0x4a/0x63
[650264.312159] ? fpu_clone+0xaa/0x380
[650264.312166] dump_stack+0x10/0x16
[650264.312172] __schedule_bug.cold+0x84/0x9a
[650264.312179] schedule_debug.constprop.0+0xce/0x110
[650264.312186] __schedule+0x5f/0x560
[650264.312194] ? task_blocks_on_rt_mutex.constprop.0.isra.0+0x1c8/0x470
[650264.312203] schedule_rtlock+0x22/0x50
[650264.312209] rtlock_slowlock_locked+0xfd/0x270
[650264.312218] rt_spin_lock+0x4d/0x70
[650264.312225] fpu_clone+0x32a/0x380
[650264.312232] copy_thread+0x100/0x270
[650264.312240] copy_process+0x9e3/0x1700
[650264.312247] kernel_clone+0x9d/0x440
[650264.312251] ? migrate_enable+0xdf/0x170
[650264.312259] ? preempt_count_sub+0xbe/0x100
[650264.312265] __do_sys_clone+0x5d/0x80
[650264.312270] __x64_sys_clone+0x25/0x30
[650264.312274] do_syscall_64+0x59/0xc0
[650264.312280] ? __this_cpu_preempt_check+0x13/0x20
[650264.312287] ? __context_tracking_enter+0x84/0x120
[650264.312294] ? syscall_exit_to_user_mode+0x2d/0x50
[650264.312301] ? do_syscall_64+0x69/0xc0
[650264.312305] ? __context_tracking_enter+0x84/0x120
[650264.312311] ? syscall_exit_to_user_mode+0x2d/0x50
[650264.312317] ? do_syscall_64+0x69/0xc0
[650264.312321] entry_SYSCALL_64_after_hwframe+0x62/0xcc
[650264.312326] RIP: 0033:0x47e7cf
[650264.312331] Code: 48 8b 74 24 18 48 c7 c2 00 00 00 00 49 c7 c2 00 00 00 00 49 c7 c0 00 00 00 00 49 c7 c1 00 00 00 00 48 8b 44 24 08 41 5c 0f 05 <41> 54 48 3d 01 f0 ff ff 76 12 48 c7 44 24 20 ff ff ff ff 48 f7 d8
[650264.312336] RSP: 002b:000000c0001e0c78 EFLAGS: 00000206 ORIG_RAX: 0000000000000038
[650264.312342] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 000000000047e7cf
[650264.312346] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000004111
[650264.312348] RBP: 000000c0001e0f28 R08: 0000000000000000 R09: 0000000000000000
[650264.312351] R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000475050
[650264.312353] R13: 000000000000002f R14: 000000c0000f0d00 R15: 000000c000080800
[650264.312362] </TASK>
[650264.312366] ------------[ cut here ]------------
[650264.312369] DEBUG_LOCKS_WARN_ON(val > preempt_count())
[650264.312375] WARNING: CPU: 18 PID: 15151 at kernel/sched/core.c:5573 preempt_count_sub+0x7f/0x100
[650264.312384] Modules linked in: iptable_nat bpfilter sctp ip6_udp_tunnel udp_tunnel macvlan xt_nat veth irdma i40e ice tls xt_tcpudp xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo nft_counter xt_addrtype nft_compat nf_tables nfnetlink br_netfilter bridge stp llc overlay rpcrdma sunrpc rdma_ucm ib_iser libiscsi scsi_transport_iscsi rdma_cm iw_cm ib_cm iavf intel_rapl_msr intel_rapl_common i10nm_edac nfit binfmt_misc x86_pkg_temp_thermal intel_powerclamp coretemp nls_iso8859_1 kvm_intel kvm rapl wmi_bmof dell_smbios dcdbas dell_wmi_descriptor isst_if_mbox_pci ipmi_ssif pmt_telemetry pmt_crashlog pmt_class ib_uverbs idxd isst_if_mmio isst_if_common idxd_bus ib_core mei_me mei acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter mac_hid sch_fq_codel dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456
[650264.312453] async_raid6_recov async_memcpy async_pq async_xor async_tx xor cdc_ether usbnet mii raid6_pq libcrc32c raid1 raid0 multipath linear mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec rc_core crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd ahci i2c_i801 cryptd megaraid_sas drm intel_pmt xhci_pci tg3 libahci i2c_smbus i2c_ismt xhci_pci_renesas wmi pinctrl_emmitsburg [last unloaded: i40e]
[650264.312488] CPU: 18 PID: 15151 Comm: containerd-shim Tainted: G W 5.15.0-1051-realtime #57-Ubuntu
[650264.312492] Hardware name: Dell Inc. PowerEdge R660/0R5JJC, BIOS 1.6.6 09/20/2023
[650264.312494] RIP: 0010:preempt_count_sub+0x7f/0x100
[650264.312500] Code: cc cc cc cc e8 12 d2 55 00 85 c0 74 ee 8b 15 f8 c1 39 02 85 d2 75 e4 48 c7 c6 fa af 5d ae 48 c7 c7 34 c8 5b ae e8 d1 01 c4 00 <0f> 0b eb cd 4c 8b 65 08 4c 89 e7 e8 c1 00 04 00 85 c0 74 1f 48 8b
[650264.312503] RSP: 0018:ff284fe50216bc38 EFLAGS: 00010282
[650264.312507] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
[650264.312509] RDX: 0000000000000000 RSI: ffffffffacf520a4 RDI: ffffffffacf520a4
[650264.312512] RBP: ff284fe50216bc48 R08: ffffffffb028f4cb R09: 00000000b028f4b3
[650264.312514] R10: ff284fe50216bb18 R11: 0000000000000001 R12: ff1c974682268000
[650264.312516] R13: ff1c974682268000 R14: ff1c97483e021640 R15: ff1c9746aa9d4180
[650264.312519] FS: 000000c000080890(0000) GS:ff1c976580080000(0000) knlGS:0000000000000000
[650264.312523] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[650264.312526] CR2: 000000c00031c000 CR3: 0000000163920003 CR4: 0000000000771ee0
[650264.312529] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[650264.312531] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[650264.312534] PKRU: 55555554
[650264.312536] Call Trace:
[650264.312537] <TASK>
[650264.312539] ? show_trace_log_lvl+0x1d6/0x2ea
[650264.312546] ? show_trace_log_lvl+0x1d6/0x2ea
[650264.312555] ? fpu_clone+0xdd/0x380
[650264.312564] ? show_regs.part.0+0x23/0x29
[650264.312569] ? show_regs.cold+0x8/0xd
[650264.312575] ? preempt_count_sub+0x7f/0x100
[650264.312580] ? __warn+0x84/0xd0
[650264.312585] ? preempt_count_sub+0x7f/0x100
[650264.312590] ? report_bug+0xa4/0xd0
[650264.312599] ? handle_bug+0x39/0x90
[650264.312604] ? exc_invalid_op+0x19/0x70
[650264.312610] ? asm_exc_invalid_op+0x1b/0x20
[650264.312616] ? __wake_up_klogd.part.0+0x64/0xd0
[650264.312624] ? __wake_up_klogd.part.0+0x64/0xd0
[650264.312630] ? preempt_count_sub+0x7f/0x100
[650264.312635] ? preempt_count_sub+0x7f/0x100
[650264.312639] fpu_clone+0xdd/0x380
[650264.312647] copy_thread+0x100/0x270
[650264.312655] copy_process+0x9e3/0x1700
[650264.312663] kernel_clone+0x9d/0x440
[650264.312666] ? migrate_enable+0xdf/0x170
[650264.312674] ? preempt_count_sub+0xbe/0x100
[650264.312679] __do_sys_clone+0x5d/0x80
[650264.312686] __x64_sys_clone+0x25/0x30
[650264.312690] do_syscall_64+0x59/0xc0
[650264.312695] ? __this_cpu_preempt_check+0x13/0x20
[650264.312702] ? __context_tracking_enter+0x84/0x120
[650264.312709] ? syscall_exit_to_user_mode+0x2d/0x50
[650264.312715] ? do_syscall_64+0x69/0xc0
[650264.312719] ? __context_tracking_enter+0x84/0x120
[650264.312726] ? syscall_exit_to_user_mode+0x2d/0x50
[650264.312732] ? do_syscall_64+0x69/0xc0
[650264.312736] entry_SYSCALL_64_after_hwframe+0x62/0xcc
[650264.312740] RIP: 0033:0x47e7cf
[650264.312744] Code: 48 8b 74 24 18 48 c7 c2 00 00 00 00 49 c7 c2 00 00 00 00 49 c7 c0 00 00 00 00 49 c7 c1 00 00 00 00 48 8b 44 24 08 41 5c 0f 05 <41> 54 48 3d 01 f0 ff ff 76 12 48 c7 44 24 20 ff ff ff ff 48 f7 d8
[650264.312748] RSP: 002b:000000c0001e0c78 EFLAGS: 00000206 ORIG_RAX: 0000000000000038
[650264.312751] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 000000000047e7cf
[650264.312754] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000004111
[650264.312756] RBP: 000000c0001e0f28 R08: 0000000000000000 R09: 0000000000000000
[650264.312758] R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000475050
[650264.312760] R13: 000000000000002f R14: 000000c0000f0d00 R15: 000000c000080800
[650264.312767] </TASK>
[650264.312769] ---[ end trace 0000000000000003 ]---

description: updated
Changed in ubuntu-realtime:
importance: Undecided → Medium
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Hello Andrew,

Do you know if this issue is a regression? Has this issue happened with prior versions of the real-time kernel? If you don't know, would it be possible for you to test a preview kernel version, such as:
5.15.0-1049.55

Changed in ubuntu-realtime:
status: New → Triaged
Revision history for this message
Andrew Sergeev (asergeev-adva) wrote (last edit ):

Hi Joseph,

It happend also in 5.15.0-1046-realtime, 5.15.0-1050-realtime (5.15.0-1050.56), then I have upgraded to 5.15.0-1051.57.
Unfortunately I could not go below 5.15.0-1051.57 now.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

If I build some test kernels, would you be able to install and test them? I could provide you kernels as .deb packages. I'd like to build some newer upstream rt kernels to see this bug exists in newer patchsets.

Changed in ubuntu-realtime:
status: Triaged → In Progress
Revision history for this message
Andrew Sergeev (asergeev-adva) wrote (last edit ):

An update - because of linuxptp "clock jumped forward or running faster than expected" warning I had to upgrade the in-tree ICE driver to the latest one (1.12.7). The upgrade eliminated the warning and it makes more time to reporoduce the bug.
Yes, I would be able to test some new kernels.

Changed in ubuntu-realtime:
assignee: nobody → Joseph Salisbury (jsalisbury)
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Could you fist test the upstream 5.15 realtime kernel? This will tell us if the bug is upstream or specific to the Ubuntu kernel.

The upstream kernel can be downloaded from:
https://people.canonical.com/~jsalisbury/lp2045379/

To install the kernel, copy the .deb packages to your test system and run:
sudo dpkg -i *.deb

Then reboot and select the new kernel from the GRUB menu (It should also be the latest kernel installed by default). You can confirm you are running this kernel with 'uname -a'

Thanks in advance!

Revision history for this message
Andrew Sergeev (asergeev-adva) wrote :

After 24 hours with this kernel the problem did not happen.

$ uname -r
5.15.141-0515141-realtime

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Thanks for testing, Andrew! That seems to indicate this bug might already be fixed in a stable update between 5.15.125 through 5.15.141.

The Ubuntu 5.15.0-1052-58 kernel should be available in -updates now. Could you apply the latest updates and test that kernel to see if the fix is in this kernel?

If it is not, we can either "Reverse bisect" to identify the fix in upstream stable, or await until it is pulled down from upstream via the normal stable update process.

Revision history for this message
Andrew Sergeev (asergeev-adva) wrote :

5.15.0-1052-realtime does not fix the problem.
it happend after 3 hours

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Sorry for the delay, Andrew. I was on holiday for a couple of weeks.

I'll start a reverse bisect and post a test kernel shortly.

Changed in ubuntu-realtime:
assignee: Joseph Salisbury (jsalisbury) → nobody
status: In Progress → Triaged
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.