After repeat of incomplete migration nova applies wrong (status=error) migration context in update_available_resource periodic job

Bug #2051685 reported by Bence Romsics
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
New
Undecided
Unassigned

Bug Description

The original problem observed in a downstream deployment was of overcommit on dedicated PCPUs and CPUPinningInvalid exception breaking update_available_resource periodic job.

The following reproduction is not an end-to-end reproduction, but I hope I can demonstrate where things go wrong.

The environment is a multi-node devstack:
devstack0 - all-in-one
devstack0a - compute

Nova is backed by libvirt/qemu/kvm.

devstack 6b0f055b
nova on devstack0 39f560d673
nova on devstack0a a72f7eaac7
libvirt 8.0.0-1ubuntu7.8
qemu 1:6.2+dfsg-2ubuntu6.16
linux 5.15.0-91-generic

# Clean up if not the first run.
openstack server list -f value -c ID | xargs -r openstack server delete --wait
openstack volume list --status available -f value -c ID | xargs -r openstack volume delete

# Create a server on devstack0.
openstack flavor create cirros256-pinned --public --vcpus 1 --ram 256 --disk 1 --property hw_rng:allowed=True --property hw:cpu_policy=dedicated
openstack server create --flavor cirros256-pinned --image cirros-0.6.2-x86_64-disk --boot-from-volume 1 --nic net-id=private --availability-zone :devstack0 vm0 --wait

# Start a live migration to devstack0a, but simulate a failure. In my environment a complete live migration takes around 20 seconds. Using 'sleep 3' it usually breaks in the 'preparing' status.
# As far as I understand other kinds of migration (like cold migration) are also affected.
openstack server migrate --live-migration vm0 --wait & sleep 2 ; ssh devstack0a sudo systemctl stop devstack@n-cpu

$ openstack server migration list --server vm0 --sort-column 'Created At'
+----+--------------------+-------------+------------+----------------+--------------+----------------+-----------+--------------------+------------+------------+----------------+--------------------+-------------------------+
| Id | UUID | Source Node | Dest Node | Source Compute | Dest Compute | Dest Host | Status | Server UUID | Old Flavor | New Flavor | Type | Created At | Updated At |
+----+--------------------+-------------+------------+----------------+--------------+----------------+-----------+--------------------+------------+------------+----------------+--------------------+-------------------------+
| 33 | c7a42f9e-dfee- | devstack0 | devstack0a | devstack0 | devstack0a | 192.168.122.79 | preparing | a2b43180-8ad9- | 11 | 11 | live-migration | 2024-01- | 2024-01- |
| | 4a2c-b42a- | | | | | | | 4c12-ad47- | | | | 29T12:41:40.000000 | 29T12:41:42.000000 |
| | a73b1a19c0c9 | | | | | | | 12b8dd7a7384 | | | | | |
+----+--------------------+-------------+------------+----------------+--------------+----------------+-----------+--------------------+------------+------------+----------------+--------------------+-------------------------+

# After some timeout (around 60 s) the migration goes to 'error' status.
$ openstack server migration list --server vm0 --sort-column 'Created At'
+----+---------------------+-------------+------------+----------------+--------------+----------------+--------+---------------------+------------+------------+----------------+---------------------+----------------------+
| Id | UUID | Source Node | Dest Node | Source Compute | Dest Compute | Dest Host | Status | Server UUID | Old Flavor | New Flavor | Type | Created At | Updated At |
+----+---------------------+-------------+------------+----------------+--------------+----------------+--------+---------------------+------------+------------+----------------+---------------------+----------------------+
| 33 | c7a42f9e-dfee-4a2c- | devstack0 | devstack0a | devstack0 | devstack0a | 192.168.122.79 | error | a2b43180-8ad9-4c12- | 11 | 11 | live-migration | 2024-01- | 2024-01- |
| | b42a-a73b1a19c0c9 | | | | | | | ad47-12b8dd7a7384 | | | | 29T12:41:40.000000 | 29T12:42:42.000000 |
+----+---------------------+-------------+------------+----------------+--------------+----------------+--------+---------------------+------------+------------+----------------+---------------------+----------------------+

# Wait before restarting n-cpu on devstack0a. I don't think I fully understand the factors of when the migration ends up finally in failed or in error status. Currently it seems to me if I restart n-cpu too quickly the migration goes to the failed state right after restart. But for the reproduction of this bug it has to stay in 'error'. So far a restart only after 10+ minutes always left the migration in 'error' status.
sleep 600 ; ssh devstack0a sudo systemctl start devstack@n-cpu

# The server gets stuck in MIGRATING status.
$ openstack server show vm0 -f value -c status -c compute_host
devstack0
MIGRATING

# We reset it to ACTIVE. While I don't have an at-will upstream reproduction at the moment without 'set --state', I believe the same end problem was observed in downstream environments without ever resorting to 'set --state'.
$ openstack server set --state active vm0
$ openstack server show vm0 -f value -c status -c compute_host
devstack0
ACTIVE

# Start following n-cpu logs on devstack0a.
sudo journalctl -u <email address hidden> -f | egrep --line-buffered 'ERROR|Applying migration context for instance'

# Try to live migrate the same server again, this time let it finish. It completes successfully. (I have occasionally observed the 2nd migration to fail with "nova.exception.DestinationDiskExists: The supplied disk path (/opt/stack/data/nova/instances/19700c16-88d9-416c-9fb8-b79bea7925ee) already exists, it is expected not to exist." My best guess is that I did not give enought time to n-cpu on devstack0a to clean up after its restart. Most of the time this error does not happen.)
openstack server migrate --live-migration vm0 --wait

# However in the logs of n-cpu on devstack0a we see that update_available_resource periodic job applies the previously failed migration context and keeps it applying every minute.
$ sudo journalctl -u <email address hidden> -f | egrep --line-buffered 'ERROR|Applying migration context for instance'
jan 29 12:57:37 devstack0a nova-compute[34197]: DEBUG nova.compute.resource_tracker [None req-ec6eeebe-f367-4246-b440-9c34915395cc None None] Applying migration context for instance a2b43180-8ad9-4c12-ad47-12b8dd7a7384 as it has an incoming, in-progress migration c7a42f9e-dfee-4a2c-b42a-a73b1a19c0c9. Migration status is error {{(pid=34197) _update_available_resource /opt/stack/nova/nova/compute/resource_tracker.py:1016}}
jan 29 12:58:38 devstack0a nova-compute[34197]: DEBUG nova.compute.resource_tracker [None req-ec6eeebe-f367-4246-b440-9c34915395cc None None] Applying migration context for instance a2b43180-8ad9-4c12-ad47-12b8dd7a7384 as it has an incoming, in-progress migration c7a42f9e-dfee-4a2c-b42a-a73b1a19c0c9. Migration status is error {{(pid=34197) _update_available_resource /opt/stack/nova/nova/compute/resource_tracker.py:1016}}
jan 29 12:59:38 devstack0a nova-compute[34197]: DEBUG nova.compute.resource_tracker [None req-ec6eeebe-f367-4246-b440-9c34915395cc None None] Applying migration context for instance a2b43180-8ad9-4c12-ad47-12b8dd7a7384 as it has an incoming, in-progress migration c7a42f9e-dfee-4a2c-b42a-a73b1a19c0c9. Migration status is error {{(pid=34197) _update_available_resource /opt/stack/nova/nova/compute/resource_tracker.py:1016}}
...

# For reference we have these migrations at the end.
rubasov@devstack0:~/src/os/openstack/devstack$ openstack server migration list --server vm0 --sort-column 'Created At'
+----+--------------------+-------------+------------+----------------+--------------+----------------+-----------+--------------------+------------+------------+----------------+--------------------+-------------------------+
| Id | UUID | Source Node | Dest Node | Source Compute | Dest Compute | Dest Host | Status | Server UUID | Old Flavor | New Flavor | Type | Created At | Updated At |
+----+--------------------+-------------+------------+----------------+--------------+----------------+-----------+--------------------+------------+------------+----------------+--------------------+-------------------------+
| 33 | c7a42f9e-dfee- | devstack0 | devstack0a | devstack0 | devstack0a | 192.168.122.79 | error | a2b43180-8ad9- | 11 | 11 | live-migration | 2024-01- | 2024-01- |
| | 4a2c-b42a- | | | | | | | 4c12-ad47- | | | | 29T12:41:40.000000 | 29T12:42:42.000000 |
| | a73b1a19c0c9 | | | | | | | 12b8dd7a7384 | | | | | |
| 34 | 20d7f349-2d5a- | devstack0 | devstack0a | devstack0 | devstack0a | 192.168.122.79 | completed | a2b43180-8ad9- | 11 | 11 | live-migration | 2024-01- | 2024-01- |
| | 4796-bedc- | | | | | | | 4c12-ad47- | | | | 29T12:56:57.000000 | 29T12:57:10.000000 |
| | 8bc6e06134cf | | | | | | | 12b8dd7a7384 | | | | | |
+----+--------------------+-------------+------------+----------------+--------------+----------------+-----------+--------------------+------------+------------+----------------+--------------------+-------------------------+

# The relevant code.
https://opendev.org/openstack/nova/src/commit/6531ed6310c4c11ee807e10d1a0fa91658c3afea/nova/compute/resource_tracker.py#L995-L1024

I don't see it yet, how this should be fixed. Maybe a migration should never be stuck indefinitely in status=error. To me status=failed seems to be the proper final state. Maybe we should not apply a status=error migration context when there is a successfully repeated migration of the same server to the same dst host. Or at least we should not apply the status=error migration context in every update_available_resource call. Maybe we also should do some cleanup when resetting the server state, however I believe the problem can appear without ever resetting the state, so that would not be a generic fix. Let me know what you think...

Also please let me know if this is sufficient to demonstrate there is a bug here. If needed, I can work on extending the reproduction to re-create the actual overcommit problems, but I hope applying the wrong migration context already shows we have a problem here and I can spare doing that work.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.