race_sched in ubuntu_stress_smoke_test will cause kernel panic on 6.8 with Azure Standard_A2_v2 instance

Bug #2068024 reported by Po-Hsu Lin
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
ubuntu-kernel-tests
New
Undecided
Unassigned
linux (Ubuntu)
Invalid
Undecided
Unassigned
Noble
Fix Committed
High
John Cabaj

Bug Description

This issue can be found on:
  * N-Azure-6.8.0-1008.8
  * N-geneirc-6.8.0-35.35
  * J-Azure-6.8.0-1008.8~22.04.1

With 100% reproduced rate on Azure Standard_A2_v2 instance, (reproduce rate 100%), it can be found on Standard_D2pds_v5 as well, but with a lower reproduce rate.

syslog output:
2024-06-04T12:21:29.655736+00:00 n-laz-az-6-8-stda2v2-u-stress-smk-test kernel: zswap: loaded using pool lzo/zbud
2024-06-04T12:21:29.727437+00:00 n-laz-az-6-8-stda2v2-u-stress-smk-test stress-ng: invoked with './stress-ng -v -t 5 --race-sched 4 --race-sched-ops 3000 --ignite-cpu --syslog --verbose --verify --oomable' by user 0 'root'
2024-06-04T12:21:29.727600+00:00 n-laz-az-6-8-stda2v2-u-stress-smk-test stress-ng: system: 'n-laz-az-6-8-stda2v2-u-stress-smk-test' Linux 6.8.0-1001-azure #1-Ubuntu SMP Tue Feb 13 17:53:47 UTC 2024 x86_64
2024-06-04T12:21:29.727683+00:00 n-laz-az-6-8-stda2v2-u-stress-smk-test stress-ng: memory (MB): total 3918.72, free 3424.57, shared 4.08, buffer 36.20, swap 0.00, free swap 0.00
2024-06-04T12:21:29.727723+00:00 n-laz-az-6-8-stda2v2-u-stress-smk-test stress-ng: stress-ng: info: [1250] setting to a 5 secs run per stressor
2024-06-04T12:21:29.805799+00:00 n-laz-az-6-8-stda2v2-u-stress-smk-test stress-ng: stress-ng: info: [1250] dispatching hogs: 4 race-sched

Console output:
[ 1167.163045] I/O error, dev loop0, sector 256 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[ 1435.517597] BUG: kernel NULL pointer dereference, address: 00000000000000a0
[ 1435.522651] #PF: supervisor read access in kernel mode
[ 1435.525407] #PF: error_code(0x0000) - not-present page
[ 1435.528122] PGD 0 P4D 0
[ 1435.529813] Oops: 0000 [#1] SMP PTI
[ 1435.531744] CPU: 0 PID: 121253 Comm: stress-ng-race- Tainted: P O 6.8.0-1008-azure #8-Ubuntu
[ 1435.536481] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090008 12/07/2018
[ 1435.543274] RIP: 0010:pick_next_task_fair+0x91/0x620
[ 1435.545480] Code: 91 00 00 00 49 81 bd b0 02 00 00 80 a8 89 92 75 60 4d 89 fe eb 27 4c 89 f7 e8 0b b7 ff ff 84 c0 75 3f 4c 89 f7 e8 5f 04 ff ff <4c> 8b b0 a0 00 00 00 48 89 c3 4d 85 f6 0f 84 f4 00 00 00 49 8b 46
[ 1435.554629] RSP: 0018:ffffb2b202e73cf8 EFLAGS: 00010096
[ 1435.558030] RAX: 0000000000000000 RBX: ffffb2b202e73dc8 RCX: fd78d84d198c4000
[ 1435.562226] RDX: 0000000000000c00 RSI: e411d03fda1d7382 RDI: 0000000000000c02
[ 1435.566496] RBP: ffffb2b202e73d38 R08: 0000000000000002 R09: 0000000000000002
[ 1435.570327] R10: 0000000000000000 R11: 0000000000000000 R12: ffff920dbbc33580
[ 1435.574620] R13: ffff920d05570000 R14: ffff920dbbc33680 R15: ffff920dbbc33680
[ 1435.579115] FS: 00007fb92ad12d00(0000) GS:ffff920dbbc00000(0000) knlGS:0000000000000000
[ 1435.583308] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1435.586094] CR2: 00000000000000a0 CR3: 0000000102364001 CR4: 00000000003706f0
[ 1435.590178] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1435.594054] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1435.597740] Call Trace:
[ 1435.599469] <TASK>
[ 1435.600605] ? show_regs+0x65/0x70
[ 1435.602396] ? __die+0x24/0x70
[ 1435.603999] ? page_fault_oops+0x99/0x1a0
[ 1435.605856] ? do_user_addr_fault+0x2ae/0x670
[ 1435.607915] ? exc_page_fault+0x7b/0x170
[ 1435.609976] ? asm_exc_page_fault+0x27/0x30
[ 1435.611989] ? pick_next_task_fair+0x91/0x620
[ 1435.614311] ? pick_next_task_fair+0x91/0x620
[ 1435.616811] ? wp_page_copy+0x2f7/0x690
[ 1435.618799] pick_next_task+0x5f/0xcd0
[ 1435.621060] ? do_wp_page+0x1d0/0x430
[ 1435.623596] __schedule+0x169/0x760
[ 1435.625947] ? __cgroup_account_cputime+0x28/0x30
[ 1435.628329] ? update_curr+0x15e/0x1e0
[ 1435.630179] schedule+0x2c/0xf0
[ 1435.633476] do_sched_yield+0x85/0xb0
[ 1435.635452] __do_sys_sched_yield+0xe/0x20
[ 1435.637356] x64_sys_call+0x3d9/0x2030
[ 1435.639400] do_syscall_64+0x7b/0x160
[ 1435.641857] ? handle_mm_fault+0xac/0x3a0
[ 1435.644956] ? irqentry_exit_to_user_mode+0x7b/0x220
[ 1435.647799] ? irqentry_exit+0x1d/0x30
[ 1435.650587] ? exc_page_fault+0x87/0x170
[ 1435.653213] entry_SYSCALL_64_after_hwframe+0x78/0x80
[ 1435.656728] RIP: 0033:0x7fb92ab0e7db
[ 1435.659593] Code: 73 01 c3 48 8b 0d 3d 46 0f 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 18 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 0d 46 0f 00 f7 d8 64 89 01 48
[ 1435.675388] RSP: 002b:00007fff7ca243d8 EFLAGS: 00000282 ORIG_RAX: 0000000000000018
[ 1435.680830] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb92ab0e7db
[ 1435.686046] RDX: 000055c47ee77db0 RSI: 0000000000000000 RDI: 0000000000000002
[ 1435.690268] RBP: 0000000000000791 R08: 0000000000000002 R09: 011d99605fac8414
[ 1435.694941] R10: 00007fb92ad12fd0 R11: 0000000000000282 R12: 00007fb92acfde18
[ 1435.698607] R13: 0000000000000002 R14: 000000000001d9a5 R15: 0000000000000008
[ 1435.703633] </TASK>
[ 1435.705016] Modules linked in: vhost_vsock vmw_vsock_virtio_transport_common vsock vhost vhost_iotlb zfs(PO) spl(O) dccp_ipv4 dccp atm sm3_generic sm3_avx_x86_64 sm3 poly1305_generic poly1305_x86_64 nhpoly1305_avx2 nhpoly1305_sse2 nhpoly1305 libpoly1305 michael_mic md4 streebog_generic rmd160 cmac algif_rng twofish_generic twofish_avx_x86_64 twofish_x86_64_3way twofish_x86_64 twofish_common serpent_avx2 serpent_avx_x86_64 serpent_sse2_x86_64 serpent_generic fcrypt cast6_avx_x86_64 cast6_generic cast5_avx_x86_64 cast5_generic cast_common camellia_generic camellia_aesni_avx2 camellia_aesni_avx_x86_64 camellia_x86_64 blowfish_generic blowfish_x86_64 blowfish_common algif_skcipher algif_hash aria_aesni_avx2_x86_64 aria_aesni_avx_x86_64 aria_generic sm4_generic sm4_aesni_avx2_x86_64 sm4_aesni_avx_x86_64 sm4 ccm des3_ede_x86_64 des_generic libdes authenc aegis128 aegis128_aesni algif_aead af_alg tls 8021q garp mrp stp llc binfmt_misc nls_iso8859_1 xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_owner xt_tcpudp
[ 1435.705128] nft_compat nf_tables serio_raw joydev dm_multipath msr nvme_fabrics efi_pstore nfnetlink ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 hid_generic hid_hyperv crct10dif_pclmul crc32_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel sha256_ssse3 sha1_ssse3 hid pata_acpi hyperv_keyboard hyperv_drm hv_netvsc aesni_intel crypto_simd cryptd
[ 1435.776455] CR2: 00000000000000a0
[ 1435.778976] ---[ end trace 0000000000000000 ]---
[ 1435.782217] RIP: 0010:pick_next_task_fair+0x91/0x620
[ 1435.785040] Code: 91 00 00 00 49 81 bd b0 02 00 00 80 a8 89 92 75 60 4d 89 fe eb 27 4c 89 f7 e8 0b b7 ff ff 84 c0 75 3f 4c 89 f7 e8 5f 04 ff ff <4c> 8b b0 a0 00 00 00 48 89 c3 4d 85 f6 0f 84 f4 00 00 00 49 8b 46
[ 1435.794724] RSP: 0018:ffffb2b202e73cf8 EFLAGS: 00010096
[ 1435.798116] RAX: 0000000000000000 RBX: ffffb2b202e73dc8 RCX: fd78d84d198c4000
[ 1435.802543] RDX: 0000000000000c00 RSI: e411d03fda1d7382 RDI: 0000000000000c02
[ 1435.807466] RBP: ffffb2b202e73d38 R08: 0000000000000002 R09: 0000000000000002
[ 1435.811823] R10: 0000000000000000 R11: 0000000000000000 R12: ffff920dbbc33580
[ 1435.815818] R13: ffff920d05570000 R14: ffff920dbbc33680 R15: ffff920dbbc33680
[ 1435.820778] FS: 00007fb92ad12d00(0000) GS:ffff920dbbc00000(0000) knlGS:0000000000000000
[ 1435.825269] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1435.828468] CR2: 00000000000000a0 CR3: 0000000102364001 CR4: 00000000003706f0
[ 1435.832087] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1435.837461] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1435.841312] note: stress-ng-race-[121253] exited with irqs disabled

I can reproduce this with 6.8.0-1001-azure + latest stress-ng (17bca4c329f8) as well.
Just run "./stress-ng -v -t 5 --race-sched 4 --race-sched-ops 3000 --ignite-cpu --syslog --verbose --verify --oomable" in stress-ng cloned from https://github.com/ColinIanKing/stress-ng (built with make command).

Po-Hsu Lin (cypressyew)
description: updated
summary: - race_sched in ubuntu_stress_smoke_test will cause kernel panic on Azure
- 6.8
+ race_sched in ubuntu_stress_smoke_test will cause kernel panic on 6.8
+ with Azure Standard_A2_v2 instance
Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

6.10.0-061000rc2-generic on Azure Standard_A2_v2 is OK.

description: updated
Revision history for this message
John Cabaj (john-cabaj) wrote :

noble:linux-azure 6.6.0-1001 works, so this was introduced somewhere in 6.8. Bisecting futher...

Revision history for this message
John Cabaj (john-cabaj) wrote :

Mainline build using v6.7 works as well. Proceeding bisect between 6.7 and 6.8

Revision history for this message
John Cabaj (john-cabaj) wrote :

The problematic commit appears to be:

2227a957e1d5b1941be4e4207879ec74f4bb37f8: "sched/eevdf: Sort the rbtree by virtual deadline"

Looking at further options right now.

Revision history for this message
John Cabaj (john-cabaj) wrote (last edit ):

This issue was fixed by the following commit (upstream as of the 6.9 kernel):

1560d1f6eb6b398bddd80c16676776c0325fe5fe "sched/eevdf: Prevent vlag from going out of bounds in reweight_eevdf()"

I've sent the patches to the mailing list for noble:linux (https://lists.ubuntu.com/archives/kernel-team/2024-June/151360.html). I left out other derivatives as they'll get them from noble:linux. Oracular is tracking past the 6.9 kernel, so these patches should already be applied there.

I've also attached the bisect logs for the break and fix commits, as well as the script used to test (along with a patch to speed up testing).

Revision history for this message
John Cabaj (john-cabaj) wrote :
Revision history for this message
John Cabaj (john-cabaj) wrote :
Revision history for this message
John Cabaj (john-cabaj) wrote :
Changed in linux (Ubuntu Noble):
status: New → In Progress
assignee: nobody → John Cabaj (john-cabaj)
tags: added: patch
Stefan Bader (smb)
Changed in linux (Ubuntu Noble):
importance: Undecided → High
status: In Progress → Fix Committed
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux/6.8.0-38.38 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-noble-linux' to 'verification-done-noble-linux'. If the problem still exists, change the tag 'verification-needed-noble-linux' to 'verification-failed-noble-linux'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-noble-linux-v2 verification-needed-noble-linux
Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

Didn't see this issue anymore with 6.8.0-38.38.

tags: added: verification-done-noble-linux
removed: verification-needed-noble-linux
Changed in linux (Ubuntu):
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.