After repeat of incomplete migration nova applies wrong (status=error) migration context in update_available_resource periodic job
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
New
|
Undecided
|
Unassigned |
Bug Description
The original problem observed in a downstream deployment was of overcommit on dedicated PCPUs and CPUPinningInvalid exception breaking update_
The following reproduction is not an end-to-end reproduction, but I hope I can demonstrate where things go wrong.
The environment is a multi-node devstack:
devstack0 - all-in-one
devstack0a - compute
Nova is backed by libvirt/qemu/kvm.
devstack 6b0f055b
nova on devstack0 39f560d673
nova on devstack0a a72f7eaac7
libvirt 8.0.0-1ubuntu7.8
qemu 1:6.2+dfsg-
linux 5.15.0-91-generic
# Clean up if not the first run.
openstack server list -f value -c ID | xargs -r openstack server delete --wait
openstack volume list --status available -f value -c ID | xargs -r openstack volume delete
# Create a server on devstack0.
openstack flavor create cirros256-pinned --public --vcpus 1 --ram 256 --disk 1 --property hw_rng:allowed=True --property hw:cpu_
openstack server create --flavor cirros256-pinned --image cirros-
# Start a live migration to devstack0a, but simulate a failure. In my environment a complete live migration takes around 20 seconds. Using 'sleep 3' it usually breaks in the 'preparing' status.
# As far as I understand other kinds of migration (like cold migration) are also affected.
openstack server migrate --live-migration vm0 --wait & sleep 2 ; ssh devstack0a sudo systemctl stop devstack@n-cpu
$ openstack server migration list --server vm0 --sort-column 'Created At'
+----+-
| Id | UUID | Source Node | Dest Node | Source Compute | Dest Compute | Dest Host | Status | Server UUID | Old Flavor | New Flavor | Type | Created At | Updated At |
+----+-
| 33 | c7a42f9e-dfee- | devstack0 | devstack0a | devstack0 | devstack0a | 192.168.122.79 | preparing | a2b43180-8ad9- | 11 | 11 | live-migration | 2024-01- | 2024-01- |
| | 4a2c-b42a- | | | | | | | 4c12-ad47- | | | | 29T12:41:40.000000 | 29T12:41:42.000000 |
| | a73b1a19c0c9 | | | | | | | 12b8dd7a7384 | | | | | |
+----+-
# After some timeout (around 60 s) the migration goes to 'error' status.
$ openstack server migration list --server vm0 --sort-column 'Created At'
+----+-
| Id | UUID | Source Node | Dest Node | Source Compute | Dest Compute | Dest Host | Status | Server UUID | Old Flavor | New Flavor | Type | Created At | Updated At |
+----+-
| 33 | c7a42f9e-dfee-4a2c- | devstack0 | devstack0a | devstack0 | devstack0a | 192.168.122.79 | error | a2b43180-8ad9-4c12- | 11 | 11 | live-migration | 2024-01- | 2024-01- |
| | b42a-a73b1a19c0c9 | | | | | | | ad47-12b8dd7a7384 | | | | 29T12:41:40.000000 | 29T12:42:42.000000 |
+----+-
# Wait before restarting n-cpu on devstack0a. I don't think I fully understand the factors of when the migration ends up finally in failed or in error status. Currently it seems to me if I restart n-cpu too quickly the migration goes to the failed state right after restart. But for the reproduction of this bug it has to stay in 'error'. So far a restart only after 10+ minutes always left the migration in 'error' status.
sleep 600 ; ssh devstack0a sudo systemctl start devstack@n-cpu
# The server gets stuck in MIGRATING status.
$ openstack server show vm0 -f value -c status -c compute_host
devstack0
MIGRATING
# We reset it to ACTIVE. While I don't have an at-will upstream reproduction at the moment without 'set --state', I believe the same end problem was observed in downstream environments without ever resorting to 'set --state'.
$ openstack server set --state active vm0
$ openstack server show vm0 -f value -c status -c compute_host
devstack0
ACTIVE
# Start following n-cpu logs on devstack0a.
sudo journalctl -u <email address hidden> -f | egrep --line-buffered 'ERROR|Applying migration context for instance'
# Try to live migrate the same server again, this time let it finish. It completes successfully. (I have occasionally observed the 2nd migration to fail with "nova.exception
openstack server migrate --live-migration vm0 --wait
# However in the logs of n-cpu on devstack0a we see that update_
$ sudo journalctl -u <email address hidden> -f | egrep --line-buffered 'ERROR|Applying migration context for instance'
jan 29 12:57:37 devstack0a nova-compute[
jan 29 12:58:38 devstack0a nova-compute[
jan 29 12:59:38 devstack0a nova-compute[
...
# For reference we have these migrations at the end.
rubasov@
+----+-
| Id | UUID | Source Node | Dest Node | Source Compute | Dest Compute | Dest Host | Status | Server UUID | Old Flavor | New Flavor | Type | Created At | Updated At |
+----+-
| 33 | c7a42f9e-dfee- | devstack0 | devstack0a | devstack0 | devstack0a | 192.168.122.79 | error | a2b43180-8ad9- | 11 | 11 | live-migration | 2024-01- | 2024-01- |
| | 4a2c-b42a- | | | | | | | 4c12-ad47- | | | | 29T12:41:40.000000 | 29T12:42:42.000000 |
| | a73b1a19c0c9 | | | | | | | 12b8dd7a7384 | | | | | |
| 34 | 20d7f349-2d5a- | devstack0 | devstack0a | devstack0 | devstack0a | 192.168.122.79 | completed | a2b43180-8ad9- | 11 | 11 | live-migration | 2024-01- | 2024-01- |
| | 4796-bedc- | | | | | | | 4c12-ad47- | | | | 29T12:56:57.000000 | 29T12:57:10.000000 |
| | 8bc6e06134cf | | | | | | | 12b8dd7a7384 | | | | | |
+----+-
# The relevant code.
https:/
I don't see it yet, how this should be fixed. Maybe a migration should never be stuck indefinitely in status=error. To me status=failed seems to be the proper final state. Maybe we should not apply a status=error migration context when there is a successfully repeated migration of the same server to the same dst host. Or at least we should not apply the status=error migration context in every update_
Also please let me know if this is sufficient to demonstrate there is a bug here. If needed, I can work on extending the reproduction to re-create the actual overcommit problems, but I hope applying the wrong migration context already shows we have a problem here and I can spare doing that work.