StarlingX

VM fails to live-migrate during host-lock due to missing cleanup

Bug #1892885 reported by Jim Gauld on 2020-08-25

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	Medium	Jim Gauld

Bug Description

Brief Description:

During host-lock VMs are automatically migrated by nfv-vim. The initial live-migrations succeeding, but subsequent live-migrations then fail for various reasons, e.g.,

See the following:
[nova-conductor-5b9cf9d5dc-cx8tt nova-conductor] 2020-08-13 17:44:42.653 1 ERROR nova.network.neutronv2.api [req-041f910e-f013-4a8e-9a40-371ed29c67ad - - - - -] [instance: 31f095fd-3c8f-488e-ba61-4e0394342cfc] Binding failed for port d33e1366-e995-467d-90da-a675041c2cff and host controller-1. Error: (409 {"NeutronError": {"message": "Binding for port d33e1366-e995-467d-90da-a675041c2cff on host controller-1 already exists.", "type": "PortBindingAlreadyExists", "detail": ""}})

This does not resolve itself without manually going into MariaDB neutron and removing port_id records corresponding to 'unbound' vif_type with status 'INACTIVE' for the given source compute host.

See the following:
[nova-compute-controller-0-937646f6-5qznb nova-compute] 2020-08-13 18:18:10.801 993998 ERROR nova.compute.manager [-] [instance: 31f095fd-3c8f-488e-ba61-4e0394342cfc] Pre live migration failed at controller-1: DestinationDiskExists_Remote: The supplied disk path (/var/lib/nova/instances/31f095fd-3c8f-488e-ba61-4e0394342cfc) already exists, it is expected not to exist.

The nova live-migration rollback code will cleanup the source disk, so the next live-migration succeeds.

See the following on the source compute when the NVF vim is disabling the host (timing dependent):
(i.e. nova-compute interpreter stops when pod is shutdown, see AMQP/rabbit messaging drop, mariadb connection drop, libvirt connection drop, compute ssh termination.

[nova-compute-controller-1-cab72f56-dzmk4 nova-compute] 2020-08-18 18:54:54.754 539096 WARNING amqp [-] Received method (60, 30) during closing channel 1. This method will be ignored
[nova-conductor-76d4fd5676-vbjn4 nova-conductor] 2020-08-18 18:54:55.820 1 WARNING amqp [-] Received method (60, 30) during closing channel 1. This method will be ignored
. .
[nova-compute-controller-1-cab72f56-bqdqp nova-compute-ssh] Received signal 15; terminating.
. .
[nova-compute-controller-0-937646f6-6lr9f nova-compute] 2020-08-18 18:54:46.353 833894 INFO nova.compute.manager [req-7e674808-dfe0-47b1-ae9e-fcb788ffbbc5 c22ce9a46bee4cae87f6222cd5799496 b8da944ac47c4eeea5d4439f7b65cd33 - default default] [instance: 3d7e983a-d95c-4f20-aaac-968c8b0f5877] Post operation of migration started
. .

Severity:
Critical - Cannot reliably do a host-lock without manual recovery.

Steps to Reproduce:
- Launch a VM with volume
- lock the host where the VM is active to cause it to migrate away

Expected Behaviour:
- VMs should live-migrate successfully and automatically, and complete cleanup
- host should become locked
- subsequent host-lock/unlock, and live-migrations should succeed

Actual Behavior:
Post-live-migration cleanup at the compute source host occurs after the nova database instances table get updated with destination host and task_state None. This source cleanup period has been observed to take up to 3 seconds for: disk, neutron ports, migration record, and console. The NFV VIM detects the live-migration is completed when it sees the host has moved to destination. This criteria does not account for the post-live-migration cleanup phase at the source. The host-disable occurs prematurely which shuts down pods on the source host, the disk cleanup, neutron ports cleanup, migration record, and console is not properly cleaned up.

We see various logs (timing dependent) for compute ssh terminated, AMQP/rabbit messaging drop, database disconnect, libvirt disconnect.

We do not see the final Live migration has completed log at the source compute, though the VM has indeed migrated.

Subsequent live-migrations or host-locks that do migrations fail since there can be neutron ports used up and that causes scheduling failure, pre-live-migration fails due to existing disk, etc.

Reproducibility
---------------
100 percent.

System Configuration
--------------------
Multi-mode system.

Branch/Pull Time/Commit
-----------------------
NA

Last Pass
---------
NA

Timestamp/Logs
--------------
NA. Root cause identified.

Test Activity
-------------
Evaluation

Workaround
----------
To cleanup the disk on the specific host:
sudo rm -fr /var/lib/nova/instances/<instance_uuid>

To cleanup neutron ports in mariadb neutron database:
(i.e. they are 'unbound' on the source compute host with status 'INACTIVE').
// Connect to mariadb mysql shell for neutron database
PODS=( $(kubectl get pods -n openstack --selector=application=mariadb,component=server --field-selector status.phase=Running --output=jsonpath={.items..metadata.name}) )
DBPOD=${PODS[0]}
kubectl exec -it -n openstack ${DBPOD} -c mariadb – bash -c 'eval env 1>/dev/null; mysql --password=$MYSQL_DBADMIN_PASSWORD --user=$MYSQL_DBADMIN_USERNAME neutron'

// Issue the following to delete the unbound port_id:
SELECT * FROM ml2_port_bindings WHERE vif_type = 'unbound' AND status = 'INACTIVE';
DELETE FROM ml2_port_bindings WHERE vif_type = 'unbound' AND status = 'INACTIVE';
quit;

Tags:

Jim Gauld (jgauld) on 2020-08-25

Changed in starlingx:
assignee:	nobody → Jim Gauld (jgauld)

OpenStack Infra (hudson-openstack) on 2020-08-25

Changed in starlingx:
status:	New → In Progress

Ghada Khalil (gkhalil) on 2020-08-26

tags:	added: stx.5.0 stx.distro.openstack
tags:	added: stx.nfv
Changed in starlingx:
importance:	Undecided → Medium

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-08-27: Fix merged to nfv (master)

Reviewed: https://review.opendev.org/747967
Committed: https://git.openstack.org/cgit/starlingx/nfv/commit/?id=c4429fd67a3e4bc669d5fbc18921eb6be79fc214
Submitter: Zuul
Branch: master

commit c4429fd67a3e4bc669d5fbc18921eb6be79fc214
Author: Jim Gauld <email address hidden>
Date: Tue Aug 25 10:23:24 2020 -0400

Add host stabilization wait for post-live-migration cleanup

This adds WaitHostStabilizeTaskWork routine to be called after
all VMs have migrated from the host during host-disable.

    Post-live-migration cleanup at the compute source host occurs
    after the nova database instances table get updated with
    destination host and task_state None. This source cleanup period
    has been observed to take up to 3 seconds for: disk, neutron
    ports, migration record, and console.

    There is not a deterministic way provided by nova to indicate
    live-migration including the cleanup has completed. This update
    gives sufficient wait (i.e., 10 seconds) for post-live-migration
    to complete at the source before host is disabled and pods are
    shutdown.

    Change-Id: Id6d500b627dea8057807bd7dfa07899bd205d3e6
    Closes-Bug: 1892885
    Signed-off-by: Jim Gauld <email address hidden>

Changed in starlingx:
status:	In Progress → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.