Description
===========
When executing online_data_migrations during an upgrade from 2023.1 to 2023.2 or beyond, the 'populate_instance_compute_id' method misses numerous relevant migrations when run against a moderately sized deployment.
The issue is that by default a limit (max-size) of 50 records is used by nova-manage in the migration, but as numerous records are unsuitable for migration (where there is no existing node ID), eventually the first 50 records returned by the database query are irrelevant, and the migration exits as if it has completed, even though many relevant records remain to be migrated.
A secondary issue is that this query approach causes the migration method to be executed many hundreds or thousands of times more than necessary as on every iteration it has to ignore the same irrelevant records. This takes a long time. In the deployment I've just upgraded the migration took upwards of 15 minutes before exiting.
Steps to reproduce
==================
Using a moderately sized database (a few tens of thousands of records).
* Perform an upgrade from 2023.1 to 2023.2
* Execute 'nova-manage db online_data_migrations'
Expected result
===============
Migrations complete in a reasonable time, with all relevant records migrated.
Actual result
=============
nova-manage exits after a long period of time with an apparent success, but in reality many records remain un-migrated.
In the two deployments we have migrated to date, we are left with the following apparently relevant records which should have been migrated but haven't been:
MariaDB [nova]> select count(*) from instances where compute_id is null and node is not null and host is not null;
+----------+
| count(*) |
+----------+
| 29147 |
+----------+
1 row in set (0.045 sec)
MariaDB [nova]> select count(*) from instances where compute_id is null and node is not null and host is not null;
+----------+
| count(*) |
+----------+
| 22622 |
+----------+
1 row in set (0.048 sec)
Environment
===========
Nova 45a926156c863b468318cce462a21027685d07a6 (upgraded from 2023.1)
Libvirt+KVM
Ceph
Neutron+LXB
Logs & Configs
==============
During nova-manage execution, log messages such as the following are printed:
In our deployment databases we have the following records which would likely trigger this issue:
MariaDB [nova]> select count(*) from instances where node is null;
+----------+
| count(*) |
+----------+
| 251 |
+----------+
1 row in set (0.029 sec)
MariaDB [nova]> select count(*) from instances where node is null;
+----------+
| count(*) |
+----------+
| 141 |
+----------+
1 row in set (0.039 sec)
Description data_migrations during an upgrade from 2023.1 to 2023.2 or beyond, the 'populate_ instance_ compute_ id' method misses numerous relevant migrations when run against a moderately sized deployment.
===========
When executing online_
The issue is that by default a limit (max-size) of 50 records is used by nova-manage in the migration, but as numerous records are unsuitable for migration (where there is no existing node ID), eventually the first 50 records returned by the database query are irrelevant, and the migration exits as if it has completed, even though many relevant records remain to be migrated.
A secondary issue is that this query approach causes the migration method to be executed many hundreds or thousands of times more than necessary as on every iteration it has to ignore the same irrelevant records. This takes a long time. In the deployment I've just upgraded the migration took upwards of 15 minutes before exiting.
My suspicion is that the query in https:/ /opendev. org/openstack/ nova/blame/ commit/ 7096423b343ffce 9622fd078fc2b3a 87fd3386f7/ nova/objects/ instance. py#L1359 should really be filtering out records without a 'node' (or 'host') entry to avoid the need for internal exception handling, but there may be some other reason this wasn't done initially.
Steps to reproduce data_migrations '
==================
Using a moderately sized database (a few tens of thousands of records).
* Perform an upgrade from 2023.1 to 2023.2
* Execute 'nova-manage db online_
Expected result
===============
Migrations complete in a reasonable time, with all relevant records migrated.
Actual result
=============
nova-manage exits after a long period of time with an apparent success, but in reality many records remain un-migrated.
In the two deployments we have migrated to date, we are left with the following apparently relevant records which should have been migrated but haven't been:
MariaDB [nova]> select count(*) from instances where compute_id is null and node is not null and host is not null;
+----------+
| count(*) |
+----------+
| 29147 |
+----------+
1 row in set (0.045 sec)
MariaDB [nova]> select count(*) from instances where compute_id is null and node is not null and host is not null;
+----------+
| count(*) |
+----------+
| 22622 |
+----------+
1 row in set (0.048 sec)
Environment 68318cce462a210 27685d07a6 (upgraded from 2023.1)
===========
Nova 45a926156c863b4
Libvirt+KVM
Ceph
Neutron+LXB
Logs & Configs
==============
During nova-manage execution, log messages such as the following are printed:
50 rows matched query populate_ instance_ compute_ id, 6 migrated instance_ compute_ id, 6 migrated instance_ compute_ id, 6 migrated instance_ compute_ id, 6 migrated instance_ compute_ id, 1 migrated instance_ compute_ id, 1 migrated instance_ compute_ id, 1 migrated instance_ compute_ id, 1 migrated instance_ compute_ id, 1 migrated instance_ compute_ id, 1 migrated instance_ compute_ id, 1 migrated instance_ compute_ id, 0 migrated ------- ------- ------- ------- ---+--- ------- ----+-- ------- --+ ------- ------- ------- ------- ---+--- ------- ----+-- ------- --+ interface_ list | 0 | 0 | quota_classes_ to_api_ db | 0 | 0 | quota_limits_ to_api_ db | 0 | 0 | migrate_ to_uuid | 0 | 0 | instance_ compute_ id | 54300 | 1667 | missing_ availability_ zones | 0 | 0 | queued_ for_delete | 0 | 0 | ------- ------- ------- ------- ---+--- ------- ----+-- ------- --+
50 rows matched query populate_
50 rows matched query populate_
50 rows matched query populate_
...
50 rows matched query populate_
50 rows matched query populate_
50 rows matched query populate_
50 rows matched query populate_
50 rows matched query populate_
50 rows matched query populate_
50 rows matched query populate_
50 rows matched query populate_
+------
| Migration | Total Needed | Completed |
+------
| fill_virtual_
| migrate_empty_ratio | 0 | 0 |
| migrate_
| migrate_
| migration_
| populate_dev_uuids | 0 | 0 |
| populate_
| populate_
| populate_
| populate_user_id | 0 | 0 |
| populate_uuids | 0 | 0 |
+------
Note the repeating number of migrations, indicating that the first 44, then 49 records are irrelevant for migration (triggering https:/ /opendev. org/openstack/ nova/blame/ commit/ 7096423b343ffce 9622fd078fc2b3a 87fd3386f7/ nova/objects/ instance. py#L1369). Also note that the total needed figure is erroneous as this contains duplicate counts of these irrelevant records each time the method is called.
A further run of the migration after the above completion shows:
50 rows matched query populate_ instance_ compute_ id, 0 migrated ------- ------- ------- ------- ---+--- ------- ----+-- ------- --+ ------- ------- ------- ------- ---+--- ------- ----+-- ------- --+ interface_ list | 0 | 0 | quota_classes_ to_api_ db | 0 | 0 | quota_limits_ to_api_ db | 0 | 0 | migrate_ to_uuid | 0 | 0 | instance_ compute_ id | 50 | 0 | missing_ availability_ zones | 0 | 0 | queued_ for_delete | 0 | 0 | ------- ------- ------- ------- ---+--- ------- ----+-- ------- --+
+------
| Migration | Total Needed | Completed |
+------
| fill_virtual_
| migrate_empty_ratio | 0 | 0 |
| migrate_
| migrate_
| migration_
| populate_dev_uuids | 0 | 0 |
| populate_
| populate_
| populate_
| populate_user_id | 0 | 0 |
| populate_uuids | 0 | 0 |
+------
If you then increase the --max-size parameter, further migrations will proceed:
100 rows matched query populate_ instance_ compute_ id, 44 migrated ------- ------- ------- ------- ---+--- ------- ----+-- ------- --+ ------- ------- ------- ------- ---+--- ------- ----+-- ------- --+ interface_ list | 0 | 0 | quota_classes_ to_api_ db | 0 | 0 | quota_limits_ to_api_ db | 0 | 0 | migrate_ to_uuid | 0 | 0 | instance_ compute_ id | 100 | 44 | missing_ availability_ zones | 0 | 0 | queued_ for_delete | 0 | 0 | ------- ------- ------- ------- ---+--- ------- ----+-- ------- --+
+------
| Migration | Total Needed | Completed |
+------
| fill_virtual_
| migrate_empty_ratio | 0 | 0 |
| migrate_
| migrate_
| migration_
| populate_dev_uuids | 0 | 0 |
| populate_
| populate_
| populate_
| populate_user_id | 0 | 0 |
| populate_uuids | 0 | 0 |
+------
In our deployment databases we have the following records which would likely trigger this issue:
MariaDB [nova]> select count(*) from instances where node is null;
+----------+
| count(*) |
+----------+
| 251 |
+----------+
1 row in set (0.029 sec)
MariaDB [nova]> select count(*) from instances where node is null;
+----------+
| count(*) |
+----------+
| 141 |
+----------+
1 row in set (0.039 sec)