VM fails to live-migrate during host-lock due to missing cleanup
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
Medium
|
Jim Gauld |
Bug Description
Brief Description:
During host-lock VMs are automatically migrated by nfv-vim. The initial live-migrations succeeding, but subsequent live-migrations then fail for various reasons, e.g.,
See the following:
[nova-conductor
This does not resolve itself without manually going into MariaDB neutron and removing port_id records corresponding to 'unbound' vif_type with status 'INACTIVE' for the given source compute host.
See the following:
[nova-compute-
The nova live-migration rollback code will cleanup the source disk, so the next live-migration succeeds.
See the following on the source compute when the NVF vim is disabling the host (timing dependent):
(i.e. nova-compute interpreter stops when pod is shutdown, see AMQP/rabbit messaging drop, mariadb connection drop, libvirt connection drop, compute ssh termination.
[nova-compute-
[nova-conductor
. .
[nova-compute-
. .
[nova-compute-
. .
Severity:
Critical - Cannot reliably do a host-lock without manual recovery.
Steps to Reproduce:
- Launch a VM with volume
- lock the host where the VM is active to cause it to migrate away
Expected Behaviour:
- VMs should live-migrate successfully and automatically, and complete cleanup
- host should become locked
- subsequent host-lock/unlock, and live-migrations should succeed
Actual Behavior:
Post-live-migration cleanup at the compute source host occurs after the nova database instances table get updated with destination host and task_state None. This source cleanup period has been observed to take up to 3 seconds for: disk, neutron ports, migration record, and console. The NFV VIM detects the live-migration is completed when it sees the host has moved to destination. This criteria does not account for the post-live-migration cleanup phase at the source. The host-disable occurs prematurely which shuts down pods on the source host, the disk cleanup, neutron ports cleanup, migration record, and console is not properly cleaned up.
We see various logs (timing dependent) for compute ssh terminated, AMQP/rabbit messaging drop, database disconnect, libvirt disconnect.
We do not see the final Live migration has completed log at the source compute, though the VM has indeed migrated.
Subsequent live-migrations or host-locks that do migrations fail since there can be neutron ports used up and that causes scheduling failure, pre-live-migration fails due to existing disk, etc.
Reproducibility
---------------
100 percent.
System Configuration
-------
Multi-mode system.
Branch/Pull Time/Commit
-------
NA
Last Pass
---------
NA
Timestamp/Logs
--------------
NA. Root cause identified.
Test Activity
-------------
Evaluation
Workaround
----------
To cleanup the disk on the specific host:
sudo rm -fr /var/lib/
To cleanup neutron ports in mariadb neutron database:
(i.e. they are 'unbound' on the source compute host with status 'INACTIVE').
// Connect to mariadb mysql shell for neutron database
PODS=( $(kubectl get pods -n openstack --selector=
DBPOD=${PODS[0]}
kubectl exec -it -n openstack ${DBPOD} -c mariadb – bash -c 'eval env 1>/dev/null; mysql --password=
// Issue the following to delete the unbound port_id:
SELECT * FROM ml2_port_bindings WHERE vif_type = 'unbound' AND status = 'INACTIVE';
DELETE FROM ml2_port_bindings WHERE vif_type = 'unbound' AND status = 'INACTIVE';
quit;
Changed in starlingx: | |
assignee: | nobody → Jim Gauld (jgauld) |
Changed in starlingx: | |
status: | New → In Progress |
tags: | added: stx.5.0 stx.distro.openstack |
tags: | added: stx.nfv |
Changed in starlingx: | |
importance: | Undecided → Medium |
Reviewed: https:/ /review. opendev. org/747967 /git.openstack. org/cgit/ starlingx/ nfv/commit/ ?id=c4429fd67a3 e4bc669d5fbc189 21eb6be79fc214
Committed: https:/
Submitter: Zuul
Branch: master
commit c4429fd67a3e4bc 669d5fbc18921eb 6be79fc214
Author: Jim Gauld <email address hidden>
Date: Tue Aug 25 10:23:24 2020 -0400
Add host stabilization wait for post-live-migration cleanup
This adds WaitHostStabili zeTaskWork routine to be called after
all VMs have migrated from the host during host-disable.
Post- live-migration cleanup at the compute source host occurs
after the nova database instances table get updated with
destination host and task_state None. This source cleanup period
has been observed to take up to 3 seconds for: disk, neutron
ports, migration record, and console.
There is not a deterministic way provided by nova to indicate
live-migration including the cleanup has completed. This update
gives sufficient wait (i.e., 10 seconds) for post-live-migration
to complete at the source before host is disabled and pods are
shutdown.
Change-Id: Id6d500b627dea8 057807bd7dfa078 99bd205d3e6
Closes-Bug: 1892885
Signed-off-by: Jim Gauld <email address hidden>