Online data migrations fail to execute correctly when upgrading from 2023.1

Bug #2065403 reported by Andrew Bonney
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Triaged
Medium
Unassigned

Bug Description

Description
===========
When executing online_data_migrations during an upgrade from 2023.1 to 2023.2 or beyond, the 'populate_instance_compute_id' method misses numerous relevant migrations when run against a moderately sized deployment.

The issue is that by default a limit (max-count) of 50 records is used by nova-manage in the migration, but as numerous records are unsuitable for migration (where there is no existing node ID), eventually the first 50 records returned by the database query are irrelevant, and the migration exits as if it has completed, even though many relevant records remain to be migrated.

A secondary issue is that this query approach causes the migration method to be executed many hundreds or thousands of times more than necessary as on every iteration it has to ignore the same irrelevant records. This takes a long time. In the deployment I've just upgraded the migration took upwards of 15 minutes before exiting.

My suspicion is that the query in https://opendev.org/openstack/nova/blame/commit/7096423b343ffce9622fd078fc2b3a87fd3386f7/nova/objects/instance.py#L1359 should really be filtering out records without a 'node' entry to avoid the need for internal exception handling, but there may be some other reason this wasn't done initially.

Steps to reproduce
==================
Using a moderately sized database (a few tens of thousands of records).
* Perform an upgrade from 2023.1 to 2023.2
* Execute 'nova-manage db online_data_migrations'

Expected result
===============
Migrations complete in a reasonable time, with all relevant records migrated.

Actual result
=============
nova-manage exits after a long period of time with an apparent success, but in reality many records remain un-migrated.

In the two deployments we have migrated to date, we are left with the following apparently relevant records which should have been migrated but haven't been:

MariaDB [nova]> select count(*) from instances where compute_id is null and node is not null and host is not null;
+----------+
| count(*) |
+----------+
| 29147 |
+----------+
1 row in set (0.045 sec)

MariaDB [nova]> select count(*) from instances where compute_id is null and node is not null and host is not null;
+----------+
| count(*) |
+----------+
| 22622 |
+----------+
1 row in set (0.048 sec)

Environment
===========
Nova 45a926156c863b468318cce462a21027685d07a6 (upgraded from 2023.1)
Libvirt+KVM
Ceph
Neutron+LXB

Logs & Configs
==============
During nova-manage execution, log messages such as the following are printed:

50 rows matched query populate_instance_compute_id, 6 migrated
50 rows matched query populate_instance_compute_id, 6 migrated
50 rows matched query populate_instance_compute_id, 6 migrated
50 rows matched query populate_instance_compute_id, 6 migrated
...
50 rows matched query populate_instance_compute_id, 1 migrated
50 rows matched query populate_instance_compute_id, 1 migrated
50 rows matched query populate_instance_compute_id, 1 migrated
50 rows matched query populate_instance_compute_id, 1 migrated
50 rows matched query populate_instance_compute_id, 1 migrated
50 rows matched query populate_instance_compute_id, 1 migrated
50 rows matched query populate_instance_compute_id, 1 migrated
50 rows matched query populate_instance_compute_id, 0 migrated
+-------------------------------------+--------------+-----------+
| Migration | Total Needed | Completed |
+-------------------------------------+--------------+-----------+
| fill_virtual_interface_list | 0 | 0 |
| migrate_empty_ratio | 0 | 0 |
| migrate_quota_classes_to_api_db | 0 | 0 |
| migrate_quota_limits_to_api_db | 0 | 0 |
| migration_migrate_to_uuid | 0 | 0 |
| populate_dev_uuids | 0 | 0 |
| populate_instance_compute_id | 54300 | 1667 |
| populate_missing_availability_zones | 0 | 0 |
| populate_queued_for_delete | 0 | 0 |
| populate_user_id | 0 | 0 |
| populate_uuids | 0 | 0 |
+-------------------------------------+--------------+-----------+

Note the repeating number of migrations, indicating that the first 44, then 49 records are irrelevant for migration (triggering https://opendev.org/openstack/nova/blame/commit/7096423b343ffce9622fd078fc2b3a87fd3386f7/nova/objects/instance.py#L1369). Also note that the total needed figure is erroneous as this contains duplicate counts of these irrelevant records each time the method is called.

A further run of the migration after the above completion shows:

50 rows matched query populate_instance_compute_id, 0 migrated
+-------------------------------------+--------------+-----------+
| Migration | Total Needed | Completed |
+-------------------------------------+--------------+-----------+
| fill_virtual_interface_list | 0 | 0 |
| migrate_empty_ratio | 0 | 0 |
| migrate_quota_classes_to_api_db | 0 | 0 |
| migrate_quota_limits_to_api_db | 0 | 0 |
| migration_migrate_to_uuid | 0 | 0 |
| populate_dev_uuids | 0 | 0 |
| populate_instance_compute_id | 50 | 0 |
| populate_missing_availability_zones | 0 | 0 |
| populate_queued_for_delete | 0 | 0 |
| populate_user_id | 0 | 0 |
| populate_uuids | 0 | 0 |
+-------------------------------------+--------------+-----------+

If you then increase the --max-count parameter, further migrations will proceed:

100 rows matched query populate_instance_compute_id, 44 migrated
+-------------------------------------+--------------+-----------+
| Migration | Total Needed | Completed |
+-------------------------------------+--------------+-----------+
| fill_virtual_interface_list | 0 | 0 |
| migrate_empty_ratio | 0 | 0 |
| migrate_quota_classes_to_api_db | 0 | 0 |
| migrate_quota_limits_to_api_db | 0 | 0 |
| migration_migrate_to_uuid | 0 | 0 |
| populate_dev_uuids | 0 | 0 |
| populate_instance_compute_id | 100 | 44 |
| populate_missing_availability_zones | 0 | 0 |
| populate_queued_for_delete | 0 | 0 |
| populate_user_id | 0 | 0 |
| populate_uuids | 0 | 0 |
+-------------------------------------+--------------+-----------+

In our deployment databases we have the following records which would likely trigger this issue:

MariaDB [nova]> select count(*) from instances where node is null;
+----------+
| count(*) |
+----------+
| 251 |
+----------+
1 row in set (0.029 sec)

MariaDB [nova]> select count(*) from instances where node is null;
+----------+
| count(*) |
+----------+
| 141 |
+----------+
1 row in set (0.039 sec)

description: updated
description: updated
Changed in nova:
status: New → Triaged
importance: Undecided → Medium
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.