ubuntu_nbd_smoke test failed on f1.micro in google cloud with 5.8 / 5.11 (kernel NULL pointer dereference)

Bug #1925465 reported by Po-Hsu Lin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
ubuntu-kernel-tests
Won't Fix
Undecided
Unassigned

Bug Description

Issue found on 5.8.0-51.57-generic with instance f1-micro only among all of our google instances.

Test failed with:
 Running '/home/jenkins/autotest/client/tests/ubuntu_nbd_smoke_test/ubuntu_nbd_smoke_test.sh'
 creating backing nbd image /tmp/nbd_image.img

 --------------------------------------------------------------------------------
 Image path: /tmp/nbd_image.img
 Mount point: /mnt/nbd-test-7613
 Date: Tue Apr 20 15:43:54 UTC 2021
 Host: g-l-generic-5-8-0-f1-micro-nbd-smoke-test
 Kernel: 5.8.0-51-generic #57-Ubuntu SMP Wed Apr 14 16:02:45 UTC 2021
 Machine: g-l-generic-5-8-0-f1-micro-nbd-smoke-test x86_64 x86_64
 CPUs online: 1
 CPUs total: 1
 Page size: 4096
 Pages avail: 1336
 Pages total: 146583
 Free space:
 Filesystem Size Used Avail Use% Mounted on
 tmpfs 58M 996K 57M 2% /run
 /dev/sda1 9.6G 3.4G 6.2G 35% /
 tmpfs 287M 0 287M 0% /dev/shm
 tmpfs 5.0M 0 5.0M 0% /run/lock
 tmpfs 4.0M 0 4.0M 0% /sys/fs/cgroup
 /dev/sda15 105M 7.9M 97M 8% /boot/efi
 tmpfs 58M 4.0K 58M 1% /run/user/1007
 --------------------------------------------------------------------------------

 NBD device /dev/nbd0 created
 found nbd export
 NBD exports found:
 test
 starting client with NBD device /dev/nbd0
 Negotiation: ..size = 128MB
 Error: Failed to setup device, check dmesg

 nbd-client failed to start
 unmounting /mnt/nbd-test-7613
 Exiting.
 umount: /mnt/nbd-test-7613: no mount point specified.
 stopping client
 /home/jenkins/autotest/client/tests/ubuntu_nbd_smoke_test/ubuntu_nbd_smoke_test.sh: line 37: 7782 Killed nbd-client -d ${NBD_DEV}
 Found kernel warning, IO error and/or call trace
 echo
 [ 155.925988] creating backing nbd image /tmp/nbd_image.img
 [ 159.733368] NBD device /dev/nbd0 created
 [ 161.978549] found nbd export
 [ 163.003727] starting client with NBD device /dev/nbd0
 [ 163.464838] nbd: nbd0 already in use
 [ 163.473599] nbd-client failed to start
 [ 163.473665] unmounting /mnt/nbd-test-7613
 [ 164.712115] stopping client
 [ 164.715177] block nbd0: NBD_DISCONNECT
 [ 164.716596] BUG: kernel NULL pointer dereference, address: 0000000000000020
 [ 164.723829] #PF: supervisor write access in kernel mode
 [ 164.729171] #PF: error_code(0x0002) - not-present page
 [ 164.734418] PGD 0 P4D 0
 [ 164.737065] Oops: 0002 [#1] SMP PTI
 [ 164.740665] CPU: 0 PID: 7782 Comm: nbd-client Not tainted 5.8.0-51-generic #57-Ubuntu
 [ 164.748604] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
 [ 164.957574] RIP: 0010:mutex_lock+0x1e/0x40
 [ 164.961800] Code: c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 54 49 89 fc e8 9d e4 ff ff 31 c0 65 48 8b 14 25 c0 7b 01 00 <3e> 49 0f b1 14 24 75 06 4c 8b 65 f8 c9 c3 4c 89 e7 e8 ac ff ff ff
 [ 164.980863] RSP: 0000:ffffbce700877998 EFLAGS: 00010246
 [ 164.986196] RAX: 0000000000000000 RBX: ffffffffb41fdd00 RCX: 0000000000000000
 [ 164.994549] RDX: ffff9f7700f6c680 RSI: ffffffffb39a8103 RDI: 0000000000000020
 [ 165.002012] RBP: ffffbce7008779a0 R08: 0000000000000000 R09: ffff9f7794c64000
 [ 165.208121] R10: ffffffffb4069a00 R11: 0000000000000005 R12: 0000000000000020
 [ 165.215568] R13: ffffbce7008779c0 R14: 0000000000000068 R15: ffffffffc078ae88
 [ 165.222840] FS: 00007f02e2ac0f80(0000) GS:ffff9f7724200000(0000) knlGS:0000000000000000
 [ 165.231062] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 [ 165.237027] CR2: 0000000000000020 CR3: 0000000014c64004 CR4: 00000000003606f0
 [ 165.244562] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 [ 165.251829] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 [ 165.457633] Call Trace:
 [ 165.460272] flush_workqueue+0x7c/0x400
 [ 165.464233] nbd_disconnect_and_put+0x58/0x80 [nbd]
 [ 165.469256] nbd_genl_disconnect+0xdd/0x1c0 [nbd]
 [ 165.474180] genl_family_rcv_msg+0x17b/0x290
 [ 165.479051] genl_rcv_msg+0x4c/0xa0
 [ 165.482740] ? genl_family_rcv_msg+0x290/0x290
 [ 165.487415] netlink_rcv_skb+0x4e/0x110
 [ 165.491362] genl_rcv+0x29/0x40
 [ 165.494643] netlink_unicast+0x218/0x330
 [ 165.498832] netlink_sendmsg+0x221/0x440
 [ 165.502868] sock_sendmsg+0x65/0x70
 [ 165.506483] ____sys_sendmsg+0x257/0x2a0
 [ 165.709230] ? sendmsg_copy_msghdr+0x7e/0xa0
 [ 165.713616] ___sys_sendmsg+0x82/0xc0
 [ 165.717391] ? lru_cache_add_active_or_unevictable+0x3a/0xb0
 [ 165.723160] ? do_anonymous_page+0x253/0x460
 [ 165.727553] ? handle_pte_fault+0x22b/0x260
 [ 165.731931] ? __handle_mm_fault+0x610/0x730
 [ 165.736501] __sys_sendmsg+0x62/0xb0
 [ 165.740185] __x64_sys_sendmsg+0x1f/0x30
 [ 165.744240] do_syscall_64+0x49/0xc0
 [ 165.747957] entry_SYSCALL_64_after_hwframe+0x44/0xa9
 [ 165.753138] RIP: 0033:0x7f02e3012777
 [ 165.756839] Code: 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 89 54 24 1c 48 89 74 24 10
 [ 165.974269] RSP: 002b:00007fff23050008 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
 [ 165.981963] RAX: ffffffffffffffda RBX: 00005571d0cb7300 RCX: 00007f02e3012777
 [ 165.989331] RDX: 0000000000000000 RSI: 00007fff23050040 RDI: 0000000000000008
 [ 165.996573] RBP: 00005571d0cb7420 R08: 0000000000000014 R09: 00005571d0cb8690
 [ 166.003826] R10: 00007f02e30dc210 R11: 0000000000000246 R12: 00005571d0cb7210
 [ 166.209742] R13: 00007fff23050040 R14: 0000000000000001 R15: 00007fff230503b0
 [ 166.217009] Modules linked in: nbd nls_iso8859_1 dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua intel_rapl_msr intel_rapl_common sb_edac rapl input_leds serio_raw pvpanic efi_pstore mac_hid sch_fq_codel drm virtio_rng ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd glue_helper psmouse virtio_net net_failover failover virtio_scsi i2c_piix4
 [ 166.465182] CR2: 0000000000000020
 [ 166.468656] ---[ end trace 469eaeb4bef09dfe ]---
 [ 166.497719] RIP: 0010:mutex_lock+0x1e/0x40
 [ 166.502034] Code: c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 54 49 89 fc e8 9d e4 ff ff 31 c0 65 48 8b 14 25 c0 7b 01 00 <3e> 49 0f b1 14 24 75 06 4c 8b 65 f8 c9 c3 4c 89 e7 e8 ac ff ff ff
 [ 166.720537] RSP: 0000:ffffbce700877998 EFLAGS: 00010246
 [ 166.725902] RAX: 0000000000000000 RBX: ffffffffb41fdd00 RCX: 0000000000000000
 [ 166.733155] RDX: ffff9f7700f6c680 RSI: ffffffffb39a8103 RDI: 0000000000000020
 [ 166.740531] RBP: ffffbce7008779a0 R08: 0000000000000000 R09: ffff9f7794c64000
 [ 166.747889] R10: ffffffffb4069a00 R11: 0000000000000005 R12: 0000000000000020
 [ 166.755142] R13: ffffbce7008779c0 R14: 0000000000000068 R15: ffffffffc078ae88
 [ 166.961921] FS: 00007f02e2ac0f80(0000) GS:ffff9f7724200000(0000) knlGS:0000000000000000
 [ 166.970133] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 [ 166.976030] CR2: 0000000000000020 CR3: 0000000014c64004 CR4: 00000000003606f0
 [ 166.983291] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 [ 166.990539] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 [ 168.985943] Found kernel warning, IO error and/or call trace
 [ 169.001889] echo
 killing server

Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

This issue does not exist on this instance with 5.8.0-49, however klebers found out that in this cycle it got only ~512MB ram. But it got ~3G ram in the last cycle, maybe this is why it is failing like this in this cycle.

tags: added: 5.8 groovy sru-20210412
tags: added: ubuntu-nbd-smoke-test
Revision history for this message
Colin Ian King (colin-king) wrote :

Does look like a low memory null ptr alloc failure to me.

Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

Hi Colin, do you have a suggested minimum memory requirement that we can add to the test?
Thanks

Revision history for this message
Kleber Sacilotto de Souza (kleber-souza) wrote :

I was not able to reproduce this issue by manually running ubuntu_nbd_smoke test on a f1.micro instance but it fails consistently when running via jenkins. I'm wondering if there is something else influencing this bug, for example some timing or memory consumption issue.

Revision history for this message
Colin Ian King (colin-king) wrote :

I suspect the overhead of running this via jenkins is enough to trip the low memory pressure issue. I'll fire up and instance and see if I can prune the test back to get this working more reliably on low memory systems.

Revision history for this message
Colin Ian King (colin-king) wrote :

But this is definitely a bug, I'm not sure it is a regression though, perhaps it's an existing issue that is now being tickled because of lower memory because kernels etc grow in size over time.

Revision history for this message
Colin Ian King (colin-king) wrote :

The test should have a log that contains info on how many pages are available when running the test, something like:

10:08:25 DEBUG| [stdout] Image path: /tmp/nbd_image.img
10:08:25 DEBUG| [stdout] Mount point: /mnt/nbd-test-20140
10:08:25 DEBUG| [stdout] Date: Mon Apr 26 10:08:25 UTC 2021
10:08:25 DEBUG| [stdout] Host: selfprovisioned-f1microgroovycking
10:08:25 DEBUG| [stdout] Kernel: 5.8.0-1028-gcp #29-Ubuntu SMP Tue Apr 13 02:15:48 UTC 2021
10:08:25 DEBUG| [stdout] Machine: selfprovisioned-f1microgroovycking x86_64 x86_64
10:08:25 DEBUG| [stdout] CPUs online: 1
10:08:25 DEBUG| [stdout] CPUs total: 1
10:08:25 DEBUG| [stdout] Page size: 4096
10:08:25 DEBUG| [stdout] Pages avail: 18522
10:08:25 DEBUG| [stdout] Pages total: 146605
10:08:25 DEBUG| [stdout] Free space:
10:08:25 DEBUG| [stdout] Filesystem Size Used Avail Use% Mounted on
10:08:25 DEBUG| [stdout] /dev/root 9.6G 2.3G 7.3G 24% /
10:08:25 DEBUG| [stdout] tmpfs 287M 0 287M 0% /dev/shm
10:08:25 DEBUG| [stdout] tmpfs 115M 988K 114M 1% /run
10:08:25 DEBUG| [stdout] tmpfs 5.0M 0 5.0M 0% /run/lock
10:08:25 DEBUG| [stdout] tmpfs 4.0M 0 4.0M 0% /sys/fs/cgroup
10:08:25 DEBUG| [stdout] /dev/sda15 105M 7.9M 97M 8% /boot/efi
10:08:25 DEBUG| [stdout] tmpfs 58M 4.0K 58M 1% /run/user/1007

Can this be supplied to the bug report as it's really useful to try and figure out the base memory available before the test is run.

Revision history for this message
Colin Ian King (colin-king) wrote :

Workaround pushed to autotest-client-tests

https://kernel.ubuntu.com/git/ubuntu/autotest-client-tests.git/commit/?id=4d566a39e00295e5473e9b7994323b69e06499a2

Please re-test to see if this now works.

Revision history for this message
Po-Hsu Lin (cypressyew) wrote :
Download full text (9.1 KiB)

Re-tested on G-5.8 generic and gcp kernel with f1.micro

G-5.8-gcp passed, with avail pages:
 Pages avail: 9988

G-5.8-generic failed twice, the 8th run failed with " Error: Failed to setup device, check dmesg
", but the 9th run failed with kernel NULL pointer dereference.

G-5.8-generic 8th run:
Running '/home/jenkins/autotest/client/tests/ubuntu_nbd_smoke_test/ubuntu_nbd_smoke_test.sh'
 creating backing nbd image /tmp/nbd_image.img

 --------------------------------------------------------------------------------
 Image path: /tmp/nbd_image.img
 Mount point: /mnt/nbd-test-7659
 Date: Tue Apr 27 04:20:54 UTC 2021
 Host: g-l-generic-5-8-0-f1-micro-nbd-smoke-test
 Kernel: 5.8.0-51-generic #57-Ubuntu SMP Wed Apr 14 16:02:45 UTC 2021
 Machine: g-l-generic-5-8-0-f1-micro-nbd-smoke-test x86_64 x86_64
 CPUs online: 1
 CPUs total: 1
 Page size: 4096
 Pages avail: 1769
 Pages total: 146583
 Image size: 64 MB
 File size: 42 MB
 Free space:
 Filesystem Size Used Avail Use% Mounted on
 tmpfs 58M 996K 57M 2% /run
 /dev/sda1 9.6G 3.3G 6.3G 35% /
 tmpfs 287M 0 287M 0% /dev/shm
 tmpfs 5.0M 0 5.0M 0% /run/lock
 tmpfs 4.0M 0 4.0M 0% /sys/fs/cgroup
 /dev/sda15 105M 7.9M 97M 8% /boot/efi
 tmpfs 58M 4.0K 58M 1% /run/user/1007
 --------------------------------------------------------------------------------

 NBD device /dev/nbd0 created
 found nbd export
 NBD exports found:
 test
 starting client with NBD device /dev/nbd0
 Negotiation: ..size = 64MB
 Error: Failed to setup device, check dmesg

 nbd-client failed to start
 unmounting /mnt/nbd-test-7659
 Exiting.
 umount: /mnt/nbd-test-7659: no mount point specified.
 stopping client
 killing server

G-5.8-generic 9th run:
 Running '/home/jenkins/autotest/client/tests/ubuntu_nbd_smoke_test/ubuntu_nbd_smoke_test.sh'
 creating backing nbd image /tmp/nbd_image.img

 --------------------------------------------------------------------------------
 Image path: /tmp/nbd_image.img
 Mount point: /mnt/nbd-test-7616
 Date: Tue Apr 27 04:51:07 UTC 2021
 Host: g-l-generic-5-8-0-f1-micro-nbd-smoke-test
 Kernel: 5.8.0-51-generic #57-Ubuntu SMP Wed Apr 14 16:02:45 UTC 2021
 Machine: g-l-generic-5-8-0-f1-micro-nbd-smoke-test x86_64 x86_64
 CPUs online: 1
 CPUs total: 1
 Page size: 4096
 Pages avail: 2763
 Pages total: 146583
 Image size: 64 MB
 File size: 42 MB
 Free space:
 Filesystem Size Used Avail Use% Mounted on
 tmpfs 58M 1012K 57M 2% /run
 /dev/sda1 9.6G 3.3G 6.3G 35% /
 tmpfs 287M 0 287M 0% /dev/shm
 tmpfs 5.0M 0 5.0M 0% /run/lock
 tmpfs 4.0M 0 4.0M 0% /sys/fs/cgroup
 /dev/sda15 105M 7.9M 97M 8% /boot/efi
 tmpfs 58M 4.0K 58M 1% /run/user/1006
 --------------------------------------------------------------------------------

 NBD device /dev/nbd0 created
 found nbd export
 NBD exports found:
 test
 starting client with NBD device /dev/nbd0
 Negotiation: ..size = 64MB
 Error: Failed to ...

Read more...

Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

Test log for G-5.8-gcp

Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

Found on Hirsute 5.11 generic / lowlatency kernel as well.
This test should be skipped on this instance.

tags: added: hirsute
tags: added: 5.11 sru-20210719
summary: - ubuntu_nbd_smoke test failed on f1.micro in google cloud with Groovy 5.8
+ ubuntu_nbd_smoke test failed on f1.micro in google cloud with 5.8 / 5.11
(kernel NULL pointer dereference)
Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

Please find attachment for the test log on f1-mirco with hirsute 5.11 (stdout), I don't have dmesg output, it can only be fetched by running the test again.

I am not sure if it's still worth it to investigate this. As Sean said we should probably just do simple boot test on this small instance (f1-micro) on google cloud.

Let me know if you ever need the dmesg output with 5.11 hirsute

Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

5.8 / 5.11 has gone, and we have stopped testing on this small instance. Marking this as won't fix.

Changed in ubuntu-kernel-tests:
status: New → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.