BUG: soft lockup - CPU#0 stuck for 22s! in Cirros 0.5.2 while detaching a volume

Bug #1931702 reported by Lee Yarwood
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Unassigned

Bug Description

Description
===========

test_live_block_migration_with_attached_volume fails during cleanups to detach a volume from an instance that has as the test name suggest been migrated, we've not got the complete console for some reason but the part we have shows the following soft lockup:

https://933286ee423f4ed9028e-1eceb8a6fb7f917522f65bda64a8589f.ssl.cf5.rackcdn.com/794766/2/check/nova-grenade-multinode/a5ff180/

[ 40.741525] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [run-parts:288]
[ 40.745566] Modules linked in: ahci libahci ip_tables x_tables nls_utf8 nls_iso8859_1 nls_ascii isofs hid_generic usbhid hid virtio_rng virtio_gpu drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm virtio_scsi virtio_net net_failover failover virtio_input virtio_blk qemu_fw_cfg 9pnet_virtio 9pnet pcnet32 8139cp mii ne2k_pci 8390 e1000
[ 40.750740] CPU: 0 PID: 288 Comm: run-parts Not tainted 5.3.0-26-generic #28~18.04.1-Ubuntu
[ 40.751458] Hardware name: OpenStack Foundation OpenStack Nova, BIOS 1.13.0-1ubuntu1.1 04/01/2014
[ 40.753365] RIP: 0010:__switch_to_asm+0x42/0x70
[ 40.754190] Code: 48 8b 9e c8 08 00 00 65 48 89 1c 25 28 00 00 00 49 c7 c4 10 00 00 00 e8 07 00 00 00 f3 90 0f ae e8 eb f9 e8 07 00 00 00 f3 90 <0f> ae e8 eb f9 49 ff cc 75 e3 48 81 c4 00 01 00 00 41 5f 41 5e 41
[ 40.755739] RSP: 0018:ffffb6a9c027bdb8 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff13
[ 40.756419] RAX: 0000000000000018 RBX: ffff97eec71e6000 RCX: 3c434e4753444bff
[ 40.757057] RDX: 0001020304050608 RSI: 8080808080808080 RDI: 0000000000000fe0
[ 40.757659] RBP: ffffb6a9c027bde8 R08: fefefefefefefeff R09: 0000000000000000
[ 40.758268] R10: 0000000000000fc8 R11: 0000000040042000 R12: 00007ffd9666df63
[ 40.758954] R13: 0000000000000000 R14: 0000000000000001 R15: 00000000000007ff
[ 40.759654] FS: 00007f55b7e936a0(0000) GS:ffff97eec7600000(0000) knlGS:0000000000000000
[ 40.760334] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 40.760830] CR2: 00000000006ad340 CR3: 0000000003cc8000 CR4: 00000000000006f0
[ 40.761685] Call Trace:
[ 40.762767] ? __switch_to_asm+0x34/0x70
[ 40.763183] ? __switch_to_asm+0x40/0x70
[ 40.763539] ? __switch_to_asm+0x34/0x70
[ 40.763895] ? __switch_to_asm+0x40/0x70
[ 40.764249] ? __switch_to_asm+0x34/0x70
[ 40.764597] ? __switch_to_asm+0x40/0x70
[ 40.764945] ? __switch_to_asm+0x34/0x70
[ 40.765311] __switch_to_asm+0x40/0x70
[ 40.765884] ? __switch_to_asm+0x34/0x70
[ 40.766239] ? __switch_to_asm+0x40/0x70
[ 40.766619] ? __switch_to_asm+0x34/0x70
[ 40.766972] ? __switch_to_asm+0x40/0x70
[ 40.767323] ? __switch_to_asm+0x34/0x70
[ 40.767677] ? __switch_to_asm+0x40/0x70
[ 40.768024] ? __switch_to_asm+0x34/0x70
[ 40.768375] ? __switch_to_asm+0x40/0x70
[ 40.768725] ? __switch_to_asm+0x34/0x70
[ 40.769516] ? __switch_to+0x112/0x480
[ 40.769864] ? __switch_to_asm+0x40/0x70
[ 40.770218] ? __switch_to_asm+0x34/0x70
[ 40.771035] ? __schedule+0x2b0/0x670
[ 40.771919] ? schedule+0x33/0xa0
[ 40.772741] ? prepare_exit_to_usermode+0x98/0xa0
[ 40.773398] ? retint_user+0x8/0x8

I'm going to see if I can instrument the test a little more to dump the console *after* the detach request so we get a better idea of what if anything went wrong in the guestOS.

Steps to reproduce
==================

nova-grenade-multinode and nova-live-migration have been hit this thus far.

Expected result
===============

test_live_block_migration_with_attached_volume passes.

Actual result
=============

test_live_block_migration_with_attached_volume fails.

Environment
===========
1. Exact version of OpenStack you are running. See the following
  list for all releases: http://docs.openstack.org/releases/

   Master.

2. Which hypervisor did you use?
   (For example: Libvirt + KVM, Libvirt + XEN, Hyper-V, PowerKVM, ...)
   What's the version of that?

   libvirt + KVM

2. Which storage type did you use?
   (For example: Ceph, LVM, GPFS, ...)
   What's the version of that?

   N/A

3. Which networking type did you use?
   (For example: nova-network, Neutron with OpenVSwitch, ...)

   N/A

Logs & Configs
==============

See above.

Tags: gate-failure
Lee Yarwood (lyarwood)
summary: - BUG: soft lockup - CPU#0 stuck for 22s! while detaching a volume
+ test_live_block_migration_with_attached_volume fails with BUG: soft
+ lockup - CPU#0 stuck for 22s! in the guestOS while detaching a volume
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/795992

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/795997

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.opendev.org/c/openstack/nova/+/795992
Committed: https://opendev.org/openstack/nova/commit/7c478ac0992d9713c0fd7ea0a0498f0b7f92ce0d
Submitter: "Zuul (22348)"
Branch: master

commit 7c478ac0992d9713c0fd7ea0a0498f0b7f92ce0d
Author: Lee Yarwood <email address hidden>
Date: Fri Jun 11 13:05:51 2021 +0100

    zuul: Skip block migration with attached volumes tests due to bug #1931702

    Bug #1931702 details soft lockups reported within the guest OS during
    live migration with block migration and a volume attached. These lockups
    then causing the request to detach the volume as part of the cleanup to
    fail. For the time being we should skip these tests until the underlying
    issue is resolved.

    Related-Bug: #1931702
    Change-Id: I7c1a647fb840fce98672a8429d554dd399cd13b7

Lee Yarwood (lyarwood)
summary: - test_live_block_migration_with_attached_volume fails with BUG: soft
- lockup - CPU#0 stuck for 22s! in the guestOS while detaching a volume
+ BUG: soft lockup - CPU#0 stuck for 22s! in Cirros 0.5.2 while detaching
+ a volume
Lee Yarwood (lyarwood)
Changed in nova:
status: New → Incomplete
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by "Stephen Finucane <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/nova/+/795997

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for OpenStack Compute (nova) because there has been no activity for 60 days.]

Changed in nova:
status: Incomplete → Expired
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/812473

Changed in nova:
status: Expired → In Progress
Revision history for this message
Lee Yarwood (lyarwood) wrote :
Download full text (14.5 KiB)

Still seeing guestOS panics during detach below:

https://53a68ba6d05d22d0f44d-c5982ef1d1780edfccf5471d12724c1b.ssl.cf5.rackcdn.com/812473/2/check/nova-grenade-multinode/31101fd/testr_results.html

2021-10-07 12:36:45,955 223284 INFO [tempest.lib.common.rest_client] Request (LiveMigrationTest:_run_cleanups): 200 POST http://10.210.192.135/compute/v2.1/servers/503a61f0-8900-4f2b-b7c4-5bd6402f575b/action 0.051s
2021-10-07 12:36:45,956 223284 DEBUG [tempest.lib.common.rest_client] Request - Headers: {'Content-Type': 'application/json', 'Accept': 'application/json', 'X-Auth-Token': '<omitted>'}
        Body: {"os-getConsoleOutput": {}}
   Response - Headers: {'date': 'Thu, 07 Oct 2021 12:36:45 GMT', 'server': 'Apache/2.4.41 (Ubuntu)', 'content-length': '9549', 'content-type': 'application/json', 'openstack-api-version': 'compute 2.1', 'x-openstack-nova-api-version': '2.1', 'vary': 'OpenStack-API-Version,X-OpenStack-Nova-API-Version', 'x-openstack-request-id': 'req-818114fc-1e1e-4e46-9748-de28184d4780', 'x-compute-request-id': 'req-818114fc-1e1e-4e46-9748-de28184d4780', 'connection': 'close', 'status': '200', 'content-location': 'http://10.210.192.135/compute/v2.1/servers/503a61f0-8900-4f2b-b7c4-5bd6402f575b/action'}
        Body: b'{"output": "[ 16.588694] BUG: unable to handle page fault for address: 0000000000461c36\\n[ 16.593560] #PF: supervisor read access in user mode\\n[ 16.593868] #PF: error_code(0x6e7860) - not-present page\\n[ 16.594289] IDT: 0xfffffe0000000000 (limit=0xfff) GDT: 0xfffffe0000001000 (limit=0x7f)\\n[ 16.594773] LDTR: NULL\\n[ 16.595779] TR: 0x40 -- base=0xfffffe0000003000 limit=0x206f\\n[ 16.596191] PGD 2da7067 P4D 2da7067 PUD 2da6067 PMD 3ce6067 PTE 0\\n[ 16.597308] Oops: 7860 [#1] SMP NOPTI\\n[ 16.597949] CPU: 0 PID: 289 Comm: wait_iface Not tainted 5.3.0-26-generic #28~18.04.1-Ubuntu\\n[ 16.598345] Hardware name: OpenStack Foundation OpenStack Nova, BIOS 1.13.0-1ubuntu1.1 04/01/2014\\n[ 16.599745] RIP: 0033:0x435117\\n[ 16.600260] Code: ff ff 48 89 df e8 c9 23 fd ff 5b e9 da fb ff ff 48 8b 15 0c 7e 27 00 53 8b 42 38 ff c0 89 42 38 bf 10 00 00 00 e8 b0 3a fd ff <48> 89 c3 48 8b 05 bf 7c 27 00 48 89 1d b8 7c 27 00 48 89 03 e8 a7\\n[ 16.601054] RSP: 002b:00007ffce3276f50 EFLAGS: 00000246\\n[ 16.601368] RAX: 00000000004822a0 RBX: 00000000006e4010 RCX: 00000000006e4010\\n[ 16.601710] RDX: 00007ffce3276eb0 RSI: 00000000006e77d0 RDI: 0000000000000007\\n[ 16.602049] RBP: 0000000000000000 R08: fefefefefefefeff R09: fefefefefefeff5a\\n[ 16.602380] R10: 0000000000000000 R11: 0000000000000246 R12: 00000000006e44b0\\n[ 16.602717] R13: 0000000000000000 R14: 00007ffce3276fa8 R15: 0000000000000000\\n[ 16.603127] FS: 00007fce1b5506a0 GS: 0000000000000000\\n[ 16.603445] Modules linked in: ahci libahci ip_tables x_tables nls_utf8 nls_iso8859_1 nls_ascii isofs hid_generic usbhid hid virtio_rng virtio_gpu drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm virtio_scsi virtio_net net_failover failover virtio_input virtio_blk qemu_fw_cfg 9pnet_virtio 9pnet pcnet32 8139cp mii ne2k_pci 8390 e1000\\n[ 16.605249] CR2: 0000000000461c36\\n[ 16.606644] ---[ end trace fafa623ec1329...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by "Lee Yarwood <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/nova/+/812473

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/c/openstack/nova/+/812473
Committed: https://opendev.org/openstack/nova/commit/512aab83c87e716356af9f19b976808aa86eb220
Submitter: "Zuul (22348)"
Branch: master

commit 512aab83c87e716356af9f19b976808aa86eb220
Author: Lee Yarwood <email address hidden>
Date: Tue Oct 5 10:44:54 2021 +0100

    Revert "zuul: Skip block migration with attached volumes tests due to bug #1931702"

    This reverts commit 7c478ac0992d9713c0fd7ea0a0498f0b7f92ce0d.

    With the resolution of bug #1945983 in devstack we should also be able
    to start testing block migration with attached volumes once again.

    Closes-Bug: #1931702
    Depends-On: https://review.opendev.org/c/openstack/devstack/+/812391
    Depends-On: https://review.opendev.org/c/openstack/devstack/+/812925
    Change-Id: I1cb7a8f76c372d19227315361ecf5ff730ec6c36

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 26.0.0.0rc1

This issue was fixed in the openstack/nova 26.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.