guest kernel panic in live migration test

Bug #1950310 reported by Balazs Gibizer
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Confirmed
High
Unassigned
Victoria
New
Undecided
Unassigned

Bug Description

There are various kernel panics visible in the guest in the nova-live-migration job. It is so far mostly visible on stable/victoria .

Example run: https://zuul.opendev.org/t/openstack/build/67c89daf17e3475cb1d632f87beeb60d/logs

Nova stack trace:

Nov 08 16:43:33.983139 ubuntu-focal-rax-iad-0027262323 nova-compute[52781]: ERROR oslo.service.loopingcall [-] Dynamic interval looping call 'oslo_service.loopingcall.RetryDecorator.__call__.<locals>._func' failed: nova.exception.DeviceDetachFailed: Device detach failed for vdb: Unable to detach the device from the live config.

Nov 08 16:43:33.983139 ubuntu-focal-rax-iad-0027262323 nova-compute[52781]: ERROR oslo.service.loopingcall Traceback (most recent call last):

Nov 08 16:43:33.983139 ubuntu-focal-rax-iad-0027262323 nova-compute[52781]: ERROR oslo.service.loopingcall File "/usr/local/lib/python3.8/dist-packages/oslo_service/loopingcall.py", line 150, in _run_loop

Nov 08 16:43:33.983139 ubuntu-focal-rax-iad-0027262323 nova-compute[52781]: ERROR oslo.service.loopingcall result = func(*self.args, **self.kw)

Nov 08 16:43:33.983139 ubuntu-focal-rax-iad-0027262323 nova-compute[52781]: ERROR oslo.service.loopingcall File "/usr/local/lib/python3.8/dist-packages/oslo_service/loopingcall.py", line 428, in _func

Nov 08 16:43:33.983139 ubuntu-focal-rax-iad-0027262323 nova-compute[52781]: ERROR oslo.service.loopingcall return self._sleep_time

Nov 08 16:43:33.983139 ubuntu-focal-rax-iad-0027262323 nova-compute[52781]: ERROR oslo.service.loopingcall File "/usr/local/lib/python3.8/dist-packages/oslo_utils/excutils.py", line 220, in __exit__

Nov 08 16:43:33.983139 ubuntu-focal-rax-iad-0027262323 nova-compute[52781]: ERROR oslo.service.loopingcall self.force_reraise()

Nov 08 16:43:33.983139 ubuntu-focal-rax-iad-0027262323 nova-compute[52781]: ERROR oslo.service.loopingcall File "/usr/local/lib/python3.8/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise

Nov 08 16:43:33.983139 ubuntu-focal-rax-iad-0027262323 nova-compute[52781]: ERROR oslo.service.loopingcall six.reraise(self.type_, self.value, self.tb)

Nov 08 16:43:33.983139 ubuntu-focal-rax-iad-0027262323 nova-compute[52781]: ERROR oslo.service.loopingcall File "/usr/local/lib/python3.8/dist-packages/six.py", line 703, in reraise

Nov 08 16:43:33.983139 ubuntu-focal-rax-iad-0027262323 nova-compute[52781]: ERROR oslo.service.loopingcall raise value

Nov 08 16:43:33.983139 ubuntu-focal-rax-iad-0027262323 nova-compute[52781]: ERROR oslo.service.loopingcall File "/usr/local/lib/python3.8/dist-packages/oslo_service/loopingcall.py", line 407, in _func

Nov 08 16:43:33.983139 ubuntu-focal-rax-iad-0027262323 nova-compute[52781]: ERROR oslo.service.loopingcall result = f(*args, **kwargs)

Nov 08 16:43:33.983139 ubuntu-focal-rax-iad-0027262323 nova-compute[52781]: ERROR oslo.service.loopingcall File "/opt/stack/nova/nova/virt/libvirt/guest.py", line 483, in _do_wait_and_retry_detach

Nov 08 16:43:33.983139 ubuntu-focal-rax-iad-0027262323 nova-compute[52781]: ERROR oslo.service.loopingcall raise exception.DeviceDetachFailed(

Nov 08 16:43:33.983139 ubuntu-focal-rax-iad-0027262323 nova-compute[52781]: ERROR oslo.service.loopingcall nova.exception.DeviceDetachFailed: Device detach failed for vdb: Unable to detach the device from the live config.

Nov 08 16:43:33.983139 ubuntu-focal-rax-iad-0027262323 nova-compute[52781]: ERROR oslo.service.loopingcall

Nov 08 16:43:33.983139 ubuntu-focal-rax-iad-0027262323 nova-compute[52781]: WARNING nova.virt.block_device [None req-9e67da96-3f0f-4e77-8bc3-0682b443b86c tempest-LiveAutoBlockMigrationV225Test-2108607333 tempest-LiveAutoBlockMigrationV225Test-2108607333-project] [instance: 06c70e38-8f0e-43a9-b783-cda99fa6b5e0] Guest refused to detach volume 8806a6e3-b560-4f72-8f1b-8cef2c257fa1: nova.exception.DeviceDetachFailed: Device detach failed for vdb: Unable to detach the device from the live config.

https://zuul.opendev.org/t/openstack/build/67c89daf17e3475cb1d632f87beeb60d/log/compute1/logs/screen-n-cpu.txt#10396-10491

Guest kernel panic:

2021-11-08 16:41:35,132 99547 DEBUG [tempest.common.waiters] Console output for 06c70e38-8f0e-43a9-b783-cda99fa6b5e0
body=
[ 15.293919] kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
[ 15.298512] BUG: unable to handle page fault for address: ffff91bdc256c400
[ 15.299353] #PF: supervisor instruction fetch in kernel mode
[ 15.299943] #PF: error_code(0x0011) - permissions violation
[ 15.300902] PGD 5e01067 P4D 5e01067 PUD 5e02067 PMD 80000000024001e3
[ 15.302056] Oops: 0011 [#1] SMP NOPTI
[ 15.302770] CPU: 0 PID: 9 Comm: ksoftirqd/0 Not tainted 5.3.0-26-generic #28~18.04.1-Ubuntu
[ 15.303549] Hardware name: OpenStack Foundation OpenStack Nova, BIOS 1.13.0-1ubuntu1.1 04/01/2014
[ 15.305100] RIP: 0010:0xffff91bdc256c400
[ 15.305758] Code: 6f 72 69 74 79 20 66 6f 72 20 25 64 20 28 25 73 29 00 25 64 20 28 25 73 29 20 6f 6c 64 20 70 72 69 6f 72 69 74 79 20 25 64 2c <20> 6e 65 77 20 70 72 69 6f 72 69 74 79 20 25 64 0a 00 70 72 6f 63
[ 15.307322] RSP: 0018:ffffa59240053de8 EFLAGS: 00000292
[ 15.307835] RAX: ffff91bdc256c400 RBX: ffff91bdc762b4c0 RCX: ffff91bdc3239900
[ 15.308457] RDX: 0000000000000400 RSI: ffffa59240053df8 RDI: ffff91bdc67e6d88
[ 15.309109] RBP: ffffa59240053e48 R08: 0000000000000000 R09: 0000000000000001
[ 15.309725] R10: ffffa59240053bc0 R11: 000000000430d5a0 R12: ffff91bdc762b510
[ 15.310344] R13: ffffa59240053df8 R14: 000000000000000a R15: 0000000000000202
[ 15.311176] FS: 0000000000000000(0000) GS:ffff91bdc7600000(0000) knlGS:0000000000000000
[ 15.311892] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 15.312403] CR2: ffff91bdc256c400 CR3: 0000000002f6c000 CR4: 00000000000006f0
[ 15.313336] Call Trace:
[ 15.314947] ? rcu_core+0x2b5/0x450
[ 15.315635] rcu_core_si+0xe/0x10
[ 15.316138] __do_softirq+0xe4/0x2da
[ 15.316573] run_ksoftirqd+0x2b/0x40
[ 15.316952] smpboot_thread_fn+0xfc/0x170
[ 15.317326] kthread+0x121/0x140
[ 15.317648] ? sort_range+0x30/0x30
[ 15.317980] ? kthread_park+0xb0/0xb0
[ 15.318337] ret_from_fork+0x35/0x40
[ 15.318819] Modules linked in: ip_tables x_tables nls_utf8 nls_iso8859_1 nls_ascii isofs hid_generic usbhid hid virtio_rng virtio_gpu drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm virtio_scsi virtio_net net_failover failover virtio_input virtio_blk qemu_fw_cfg 9pnet_virtio 9pnet pcnet32 8139cp mii ne2k_pci 8390 e1000
[ 15.322811] CR2: ffff91bdc256c400
[ 15.324234] ---[ end trace 73d738baa971ca73 ]---
[ 15.324797] RIP: 0010:0xffff91bdc256c400
[ 15.325176] Code: 6f 72 69 74 79 20 66 6f 72 20 25 64 20 28 25 73 29 00 25 64 20 28 25 73 29 20 6f 6c 64 20 70 72 69 6f 72 69 74 79 20 25 64 2c <20> 6e 65 77 20 70 72 69 6f 72 69 74 79 20 25 64 0a 00 70 72 6f 63
[ 15.326679] RSP: 0018:ffffa59240053de8 EFLAGS: 00000292
[ 15.327142] RAX: ffff91bdc256c400 RBX: ffff91bdc762b4c0 RCX: ffff91bdc3239900
[ 15.327742] RDX: 0000000000000400 RSI: ffffa59240053df8 RDI: ffff91bdc67e6d88
[ 15.328342] RBP: ffffa59240053e48 R08: 0000000000000000 R09: 0000000000000001
[ 15.328964] R10: ffffa59240053bc0 R11: 000000000430d5a0 R12: ffff91bdc762b510
[ 15.329563] R13: ffffa59240053df8 R14: 000000000000000a R15: 0000000000000202
[ 15.330167] FS: 0000000000000000(0000) GS:ffff91bdc7600000(0000) knlGS:0000000000000000
[ 15.330854] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 15.331348] CR2: ffff91bdc256c400 CR3: 0000000002f6c000 CR4: 00000000000006f0
[ 15.332107] Kernel panic - not syncing: Fatal exception in interrupt
[ 15.333470] Kernel Offset: 0x34800000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 15.334628] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---

Additional hits: https://paste.opendev.org/show/810868/

Tags: gate-failure
description: updated
description: updated
tags: added: gate-failure
Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

We noticed that on wallaby,xena and master cirros 0.5.2 is used while on victoria cirros 0.5.1. But we ruled out this as the possible source of the problem by trying 0.5.2 on victoria and seeing the same panics https://review.opendev.org/c/openstack/nova/+/817173

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

Another difference between wallaby(and newer) and victoria is that patch series, changing the libvirt device detach flow, is only backported till wallaby. So now a victoria backport is proposed to see if that helps or not https://review.opendev.org/q/topic:bug/1882521

Changed in nova:
status: New → Invalid
Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :
Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

I'm moving this back to new on master to keep it listed in our bug query.

Changed in nova:
status: Invalid → New
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/817562

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/victoria)

Related fix proposed to branch: stable/victoria
Review: https://review.opendev.org/c/openstack/nova/+/817564

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by "Balazs Gibizer <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/nova/+/817562
Reason: wrong target branch

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

summary from today's troubleshooting (for the longer form read the IRC log[1]):
* the kernel panic happens during or right after the live migration finishes so when the volume detach happens the guest is already hanging
* the oldest log we have is from 29th of Oct and it is green, and it has the same qemu and libvirt versions (qemu-system-x86 amd64 1:4.2-3ubuntu6.18 and libvirt0 amd64 6.0.0-0ubuntu8.14) as the recent runs producing the panic
* the october run uses cirros 0.5.1 as well as recent runs. The master uses 0.5.2 but we was able to reproduce the bug with 0.5.2 on stable/victoria so it is not the guest OS

[1]https://meetings.opendev.org/irclogs/%23openstack-nova/%23openstack-nova.2021-11-12.log.html#t2021-11-12T13:17:46

Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :

Putting to Confirmed given we already triaged it.

Changed in nova:
status: New → Confirmed
importance: Undecided → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/victoria)

Change abandoned by "Stephen Finucane <email address hidden>" on branch: stable/victoria
Review: https://review.opendev.org/c/openstack/nova/+/817564

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.