STX-Openstack: "nova live-migration" fails to live migrate after host is forcefully turned off/on

Bug #2007303 reported by Lucas de Ataides Barreto
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Low
Unassigned

Bug Description

Brief Description
-----------------
After a host is forcefully turned off / on, looks like the "nova live-migration" command is not working to this host.

Severity
--------
Minor: System/Feature is usable with minor issue

Steps to Reproduce
------------------
 1. Create VMs (on the test, it created 4) on a host with any image, flavor and volume on the standby controller host
 2. Turn off the VMs' host (ungracefully) so that they're evacuated
 3. After the host is turned on again and is up on the hypervisor list, live migrate one of the VMs to the host that was turned off / on

Expected Behavior
------------------
VM is migrated to the controller (the one that was turned off / on)

Actual Behavior
----------------
VM fails to migrate

Reproducibility
---------------
Intermittent - Happened 2/5 times

System Configuration
--------------------
Bare metal AIO-DX

Branch/Pull Time/Commit
-----------------------
https://mirror.starlingx.cengn.ca/mirror/starlingx/master/debian/monolithic/20230210T070000Z

Last Pass
---------
https://lists.starlingx.io/pipermail/starlingx-discuss/2023-February/013825.html

Timestamp/Logs
--------------
[sysadmin@controller-1 ~(keystone_admin)]$ system host-list
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | available |
| 2 | controller-1 | controller | unlocked | enabled | available |
+----+--------------+-------------+----------------+-------------+--------------+
[sysadmin@controller-1 ~(keystone_admin)]$ openstack hypervisor list
+----+---------------------+-----------------+---------------+-------+
| ID | Hypervisor Hostname | Hypervisor Type | Host IP | State |
+----+---------------------+-----------------+---------------+-------+
| 2 | controller-0 | QEMU | 192.168.206.2 | up |
| 4 | controller-1 | QEMU | 192.168.206.3 | up |
+----+---------------------+-----------------+---------------+-------+
[sysadmin@controller-1 ~(keystone_admin)]$ openstack server show 6a088384-da5a-406a-bac6-937efd458043
+-------------------------------------+----------------------------------------------------------+
| Field | Value |
+-------------------------------------+----------------------------------------------------------+
| OS-DCF:diskConfig | MANUAL |
| OS-EXT-AZ:availability_zone | nova |
| OS-EXT-SRV-ATTR:host | controller-1 |
| OS-EXT-SRV-ATTR:hypervisor_hostname | controller-1 |
| OS-EXT-SRV-ATTR:instance_name | instance-00000012 |
| OS-EXT-STS:power_state | Running |
| OS-EXT-STS:task_state | None |
| OS-EXT-STS:vm_state | active |
| OS-SRV-USG:launched_at | 2023-02-09T14:18:06.000000 |
| OS-SRV-USG:terminated_at | None |
| accessIPv4 | |
| accessIPv6 | |
| addresses | |
| config_drive | |
| created | 2023-02-09T14:08:22Z |
| flavor | flv_nolocaldisk (742cb13f-4e0c-4b74-95b7-d4d435a56121) |
| hostId | 666dce8933de5d1d098496b797eb447cee8729985a44a6e856404766 |
| id | 6a088384-da5a-406a-bac6-937efd458043 |
| image | N/A (booted from volume) |
| key_name | keypair-tenant1 |
| name | tenant1-vol_nolocal-8 |
| progress | 0 |
| project_id | 11ce72602daf4547866b690f517d88df |
| properties | |
| security_groups | name='default' |
| | name='default' |
| status | ACTIVE |
| updated | 2023-02-09T14:28:16Z |
| user_id | f9e688ff61204c6697b7935396faad2b |
| volumes_attached | id='211283be-f5ce-4399-8119-75a79eafc54c' |
+-------------------------------------+----------------------------------------------------------+
[sysadmin@controller-1 ~(keystone_admin)]$ nova live-migration 6a088384-da5a-406a-bac6-937efd458043 controller-0
[sysadmin@controller-1 ~(keystone_admin)]$ nova migration-list
+----+--------------------------------------+--------------+--------------+----------------+--------------+---------------+-----------+--------------------------------------+------------+------------+----------------------------+----------------------------+----------------+----------------------------------+----------------------------------+
| Id | UUID | Source Node | Dest Node | Source Compute | Dest Compute | Dest Host | Status | Instance UUID | Old Flavor | New Flavor | Created At | Updated At | Type | Project ID | User ID |
+----+--------------------------------------+--------------+--------------+----------------+--------------+---------------+-----------+--------------------------------------+------------+------------+----------------------------+----------------------------+----------------+----------------------------------+----------------------------------+
| 8 | 8406fb4a-a79a-4864-a8f8-24a6e694ae63 | - | - | controller-1 | - | - | error | 6a088384-da5a-406a-bac6-937efd458043 | 28 | 28 | 2023-02-09T14:33:19.000000 | 2023-02-09T14:33:22.000000 | live-migration | 14af550bd79d4164a3a286ca34c7ea0b | 715b70b9d5254a6696ab2c2e83724c36 |
| 6 | 0e5df0a4-4c28-4ba8-8f5e-50de516665ae | controller-0 | controller-1 | controller-0 | controller-1 | 192.168.206.3 | completed | aa6e291c-c62e-4c93-8eaa-657d98900ce0 | None | None | 2023-02-09T14:17:23.000000 | 2023-02-09T14:28:17.000000 | evacuation | 14af550bd79d4164a3a286ca34c7ea0b | 3dc21bdae8254214b97b459b0148d90f |
| 5 | a79d7adb-a021-4767-9d68-23b2eeecb00a | controller-0 | controller-1 | controller-0 | controller-1 | 192.168.206.3 | completed | 6a088384-da5a-406a-bac6-937efd458043 | None | None | 2023-02-09T14:17:23.000000 | 2023-02-09T14:28:17.000000 | evacuation | 14af550bd79d4164a3a286ca34c7ea0b | 3dc21bdae8254214b97b459b0148d90f |
| 4 | f2147581-04e7-4102-a9ef-98c6bd2c7bb3 | controller-0 | controller-1 | controller-0 | controller-1 | 192.168.206.3 | completed | 59df8780-098b-4558-a44e-148dc0832b53 | None | None | 2023-02-09T14:17:23.000000 | 2023-02-09T14:28:17.000000 | evacuation | 14af550bd79d4164a3a286ca34c7ea0b | 3dc21bdae8254214b97b459b0148d90f |
| 3 | 883e7e52-1c30-42e0-98ec-1d7849b5d1e8 | controller-0 | controller-1 | controller-0 | controller-1 | 192.168.206.3 | completed | 4b8dde15-dcc1-4f13-9d28-728e2b3c9c08 | None | None | 2023-02-09T14:17:23.000000 | 2023-02-09T14:28:16.000000 | evacuation | 14af550bd79d4164a3a286ca34c7ea0b | 3dc21bdae8254214b97b459b0148d90f |
+----+--------------------------------------+--------------+--------------+----------------+--------------+---------------+-----------+--------------------------------------+------------+------------+----------------------------+----------------------------+----------------+----------------------------------+----------------------------------+


Test Activity
-------------
Sanity

Workaround
----------
Not confirmed: Wait for a few minutes and then try the live migration again

tags: added: stx.9.0
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to openstack-armada-app (master)
Changed in starlingx:
status: New → In Progress
Revision history for this message
Thales Elero Cervi (tcervi) wrote :

Although this issue was first reported long ago when stx-openstack was still running OpenStack@Ussuri services, it is again reproducible with OpenStack@2023.1 due to:
  * https://bugs.launchpad.net/nova/+bug/2023035
  * https://bugs.launchpad.net/nova/+bug/2039803

I've confirmed that our new live-migration issues are exactly the reported Nova libvirt driver issue:

sysadmin@compute-1:~$ sudo virsh cpu-compare visrh-caps-compute0.xml
CPU described in visrh-caps-compute0.xml is identical to host CPU

sysadmin@compute-1:~$ sudo virsh hypervisor-cpu-compare visrh-caps-compute0.xml
CPU described in visrh-caps-compute0.xml is incompatible with the CPU provided by hypervisor on the host

sysadmin@compute-1:~$ sudo virsh hypervisor-cpu-compare visrh-caps-compute0.xml --error
error: Failed to compare hypervisor CPU with visrh-caps-compute0.xml
error: the CPU is incompatible with host CPU: Host CPU does not provide required features: ds, acpi, ht, tm, pbe, dtes64, monitor, ds_cpl, smx, est, tm2, xtpr, dca, osxsave

sysadmin@compute-1:~$ virsh domcapabilities > virsh-domcaps-compute1.xml

sysadmin@compute-1:~$ sudo virsh hypervisor-cpu-compare virsh-domcaps-compute1.xml
CPU described in virsh-domcaps-compute1.xml is incompatible with the CPU provided by hypervisor on the host

sysadmin@compute-1:~$ sudo virsh cpu-compare virsh-domcaps-compute1.xml
CPU described in virsh-domcaps-compute1.xml is incompatible with host CP

So fix was proposed to start using nova.conf [workarounds] "skip_cpu_compare_on_dest" option, set to True by default.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to openstack-armada-app (master)

Reviewed: https://review.opendev.org/c/starlingx/openstack-armada-app/+/900791
Committed: https://opendev.org/starlingx/openstack-armada-app/commit/c1336173394040d168963cae26475036c107fe19
Submitter: "Zuul (22348)"
Branch: master

commit c1336173394040d168963cae26475036c107fe19
Author: Thales Elero Cervi <email address hidden>
Date: Mon Nov 13 09:25:40 2023 -0300

    Skip destination CPU check during live-migration

    OpenStack 2023.1 (Antelope) introduced this issue [1][2] when
    live-migrating instances, in which Nova uses getCapabilities() to
    determine the host CPU model but use the model from the
    domCapabilities for the guest VM using host-model [3].
    According to the libvirt maintainers nova should never use
    getCapabilities for anything any more.
    A solution for this issue is being developed in main branch [4], but
    taken as a medium priority since there is a workaround config option
    already available [5] to avoid this situation.

    Setting "skip_cpu_compare_on_dest" to True will, during live
    migration, skip comparing guest CPU with the destination host.
    When using QEMU >= 2.9 and libvirt >= 4.4.0, libvirt will do the
    correct thing with respect to checking CPU compatibility on the
    destination host during live migration.
    StarlingX currently delivers QEMU 5.2 and stx-openstack uses a libvirt
    on 7.0.0-3, therefore this can be safely used as part of our default
    nova configuration updates.

    [1] https://bugs.launchpad.net/nova/+bug/2023035
    [2] https://bugs.launchpad.net/nova/+bug/2039803
    [3] https://review.opendev.org/q/topic:fix_compareCPU_usage
    [4] https://review.opendev.org/c/openstack/nova/+/899185
    [5] https://docs.openstack.org/nova/latest/configuration/config.html#skip_cpu_compare_on_dest

    Closes-bug: 2007303

    TEST PLAN:
    PASS - Build python3-k8sapp-openstack plugins
    PASS - Build stx-openstack application
    PASS - Upload/Apply/Remove stx-openstack
    PASS - Live-migrate an instance
    PASS - Manually reboot a compute node in which
           an instance is running to ensure it is
           correctly evacuated to another compute

    Change-Id: Id7a93445af4115ee81b035b4f9dc7a6eb889555b
    Signed-off-by: Thales Elero Cervi <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Low
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.