Description =========== We have an OpenStack Zed running based on the Ubuntu cloud archive packages. We regularly live-migrate instances between hypervisors for host maintenance, but suddenly had issues with many instances. Nova correctly refused to migrate an instance from an AMD CPU to an Intel CPU. We have different flavors for different CPU kinds and scheduler filter that limit these flavors to specific hosts, but here the scheduler failed. I picked one failing instance and started debugging. The flavor was `a2.medium` (ID: 982), a flavor running on AMD CPUs, but our scheduler filter was called with a request spec containing an `e2.medium` (ID: 831) flavor (Intel CPU). It therefore filtered for the wrong hosts, and Nova aborted the live migration because of an invalid target. The request spec the filter received looked like this: RequestSpec( availability_zone=None, flavor=Flavor(831), force_hosts=None, force_nodes=None, id=485269, ignore_hosts=[...], image=ImageMeta(35c8d5e3-2791-4565-9c19-291869fde98d), instance_group=None, instance_uuid=982c3ead-59b1-4acd-876b-d55166d8e7f0, is_bfv=False, limits=SchedulerLimits, network_metadata=NetworkMetadata, num_instances=1, numa_topology=None, pci_requests=InstancePCIRequests, project_id='16e980bb63b4415694dd2130f5977b8b', request_level_params=RequestLevelParams, requested_destination=Destination, requested_networks=NetworkRequestList, requested_resources=[], retry=None, scheduler_hints={}, security_groups=SecurityGroupList, user_id='d7451e969b3f4229bd1869ed9ad591f3' ) Despite that, the instance itself is listed with the correct flavor: $ openstack server show 982c3ead-59b1-4acd-876b-d55166d8e7f0 | grep flavor | flavor | disk='80', ephemeral='0', original_name='a2.medium', ram='8192', swap='0', vcpus='4' Here is an except of our flavors: MariaDB [novaapi]> select id,flavorid,name,vcpus,memory_mb,root_gb from flavors where name LIKE 'a2.%' OR name LIKE 'e2.%'; +------+----------+------------+-------+-----------+---------+ | id | flavorid | name | vcpus | memory_mb | root_gb | +------+----------+------------+-------+-----------+---------+ | 807 | 022030 | e2.micro | 2 | 2048 | 25 | | 813 | 022010 | e2.nano | 1 | 512 | 10 | | 819 | 022060 | e2.large | 4 | 12288 | 40 | | 822 | 022020 | e2.tiny | 2 | 1024 | 20 | | 825 | 022070 | e2.xlarge | 4 | 16384 | 60 | | 828 | 022080 | e2.2xlarge | 8 | 32768 | 80 | | 831 | 022050 | e2.medium | 4 | 8192 | 25 | | 834 | 022040 | e2.small | 2 | 4096 | 25 | | 980 | 022090 | e2.4xlarge | 16 | 65536 | 160 | | 982 | 026050 | a2.medium | 4 | 8192 | 80 | | 1003 | 026040 | a2.small | 2 | 4096 | 40 | | 1005 | 026060 | a2.large | 6 | 12288 | 120 | | 1008 | 026070 | a2.xlarge | 8 | 16384 | 160 | | 1011 | 026090 | a2.4xlarge | 32 | 65536 | 640 | | 1014 | 026080 | a2.2xlarge | 16 | 32768 | 320 | +------+----------+------------+-------+-----------+---------+ The instance also has the correct flavor listed in Horizon, CLI and in the MySQL database (instance_type_id). It is running (according to the database and actual) on the correct compute host (compute-a2b1). MariaDB [nova]> select * from instances where uuid = '982c3ead-59b1-4acd-876b-d55166d8e7f0' \G *************************** 1. row *************************** ... launched_on: compute-t2a3 instance_type_id: 982 <========================================== uuid: 982c3ead-59b1-4acd-876b-d55166d8e7f0 ... node: compute-a2b1.cld.domain.tld Yet, nova-scheduler runs the filter with the wrong request_spec object shown above. I've followed the `request_spec` via source code and print statements through nova-scheduler, nova-conductor, to nova-api, where it was loaded in https://github.com/openstack/nova/blob/stable/zed/nova/compute/api.py#L5496. Here, the request spec is loaded already with the wrong flavor ID (831 instead of 982). A look at the database confirmed that: MariaDB [novaapi]> select * from request_specs where instance_uuid = '7f83337d-88a9-4f49-a4b0-cc0495ea698a' \G *************************** 1. row *************************** created_at: 2024-01-10 11:19:30 updated_at: NULL id: 499312 instance_uuid: 7f83337d-88a9-4f49-a4b0-cc0495ea698a spec: { "nova_object.name": "RequestSpec", "nova_object.namespace": "nova", "nova_object.version": "1.14", "nova_object.data": { "image": { "nova_object.name": "ImageMeta", "nova_object.namespace": "nova", "nova_object.version": "1.8", "nova_object.data": { "id": "4cd37a9e-7bd6-443d-83f5-1b96f7ff005d", "name": "ubuntu-22.04", "status": "active", "checksum": "f4a9b90d378d90fdbf66b2ad3afe4da7", "owner": "b18c2da2dbfa45138fb6077eafb2aa51", "size": 2361393152, "container_format": "bare", "disk_format": "raw", "created_at": "2023-12-12T02:06:42Z", "updated_at": "2023-12-12T02:11:15Z", "min_ram": 128, "min_disk": 5, "properties": { "nova_object.name": "ImageMetaProps", "nova_object.namespace": "nova", "nova_object.version": "1.31", "nova_object.data": { "hw_architecture": "x86_64", "hw_disk_bus": "scsi", "hw_firmware_type": "uefi", "hw_qemu_guest_agent": true, "hw_scsi_model": "virtio-scsi", "hw_vm_mode": "hvm", "img_hv_type": "kvm", "os_admin_user": "ubuntu", "os_distro": "ubuntu", "os_require_quiesce": true, "os_type": "linux" }, "nova_object.changes": [ "hw_disk_bus", "hw_architecture", "hw_vm_mode", "os_type", "hw_qemu_guest_agent", "os_admin_user", "os_distro", "hw_firmware_type", "hw_scsi_model", "os_require_quiesce", "img_hv_type" ] } }, "nova_object.changes": [ "updated_at", "min_ram", "size", "id", "properties", "status", "disk_format", "created_at", "name", "owner", "checksum", "min_disk", "container_format" ] }, "numa_topology": null, "pci_requests": { "nova_object.name": "InstancePCIRequests", "nova_object.namespace": "nova", "nova_object.version": "1.1", "nova_object.data": { "requests": [] }, "nova_object.changes": ["requests"] }, "project_id": "61454faef1234faa86673d8b7760938a", "user_id": "13ef2628df2b4eba875934d148d2cd26", "availability_zone": null, "flavor": { "nova_object.name": "Flavor", "nova_object.namespace": "nova", "nova_object.version": "1.2", "nova_object.data": { "id": 982, "name": "a2.medium", "memory_mb": 8192, "vcpus": 4, "root_gb": 80, "ephemeral_gb": 0, "flavorid": "026050", "swap": 0, "rxtx_factor": 1.0, "vcpu_weight": 0, "disabled": false, "is_public": true, "extra_specs": { "hw:cpu_max_sockets": "1", "hw:cpu_policy": "shared", "os:secure_boot": "disabled", "quota:cpu_shares": "400" }, "description": null, "created_at": "2023-03-29T15:06:03Z", "updated_at": null, "deleted_at": null, "deleted": false }, "nova_object.changes": ["extra_specs"] }, "num_instances": 1, "ignore_hosts": null, "force_hosts": null, "force_nodes": null, "requested_destination": null, "retry": null, "limits": { "nova_object.name": "SchedulerLimits", "nova_object.namespace": "nova", "nova_object.version": "1.0", "nova_object.data": { "numa_topology": null, "vcpu": null, "disk_gb": null, "memory_mb": null }, "nova_object.changes": ["vcpu", "numa_topology", "disk_gb", "memory_mb"] }, "instance_group": null, "scheduler_hints": {}, "instance_uuid": "7f83337d-88a9-4f49-a4b0-cc0495ea698a", "security_groups": { "nova_object.name": "SecurityGroupList", "nova_object.namespace": "nova", "nova_object.version": "1.1", "nova_object.data": { "objects": [ { "nova_object.name": "SecurityGroup", "nova_object.namespace": "nova", "nova_object.version": "1.2", "nova_object.data": { "uuid": "5d94b29d-fdb1-4633-92df-8fa47e79864b" }, "nova_object.changes": ["uuid"] } ] }, "nova_object.changes": ["objects"] }, "is_bfv": false, "requested_resources": [] }, "nova_object.changes": [ "image", "is_bfv", "requested_destination", "security_groups", "force_nodes", "num_instances", "retry", "numa_topology", "instance_group", "limits", "instance_uuid", "availability_zone", "user_id", "requested_resources", "force_hosts", "ignore_hosts", "project_id", "pci_requests", "scheduler_hints", "flavor" ] } 1 row in set (0.000 sec) I further identified the `instance_extra` data to be out of sync, with a "new" flavor present: MariaDB [nova]> select * from instance_extra where instance_uuid = '982c3ead-59b1-4acd-876b-d55166d8e7f0' \G *************************** 1. row *************************** created_at: 2023-11-06 14:44:23 updated_at: 2024-01-10 11:26:50 deleted_at: NULL deleted: 0 id: 442795 instance_uuid: 982c3ead-59b1-4acd-876b-d55166d8e7f0 numa_topology: NULL pci_requests: [] flavor: { "cur": { "nova_object.name": "Flavor", "nova_object.namespace": "nova", "nova_object.version": "1.2", "nova_object.data": { "id": 982, "name": "a2.medium", "memory_mb": 8192, "vcpus": 4, "root_gb": 80, "ephemeral_gb": 0, "flavorid": "026050", "swap": 0, "rxtx_factor": 1.0, "vcpu_weight": 0, "disabled": false, "is_public": true, "extra_specs": { "hw:cpu_max_sockets": "1", "hw:cpu_policy": "shared", "os:secure_boot": "disabled", "quota:cpu_shares": "400" }, "description": null, "created_at": "2023-03-29T15:06:03Z", "updated_at": null, "deleted_at": null, "deleted": false }, "nova_object.changes": ["extra_specs"] }, "old": null, "new": { "nova_object.name": "Flavor", "nova_object.namespace": "nova", "nova_object.version": "1.2", "nova_object.data": { "id": 831, "name": "e2.medium", "memory_mb": 8192, "vcpus": 4, "root_gb": 25, "ephemeral_gb": 0, "flavorid": "022050", "swap": 0, "rxtx_factor": 1.0, "vcpu_weight": 0, "disabled": false, "is_public": true, "extra_specs": { "hw:cpu_max_sockets": "1", "hw:cpu_policy": "shared", "quota:cpu_shares": "400" }, "description": null, "created_at": "2022-08-03T11:11:49Z", "updated_at": null, "deleted_at": null, "deleted": false }, "nova_object.changes": ["extra_specs"] } } device_metadata: NULL trusted_certs: NULL vpmems: NULL resources: NULL 1 row in set (0.000 sec) It appears that a user tried to resize the instance before, which failed (no idea why yet), and `instance_extra` as well as the `request_spec` data wasn't reverted correctly: $ openstack server migration list --server 982c3ead-59b1-4acd-876b-d55166d8e7f0 +--------+---------------------------+---------------------------+---------------------------+----------------+--------------+--------------+-----------+---------------------------+------------+------------+----------------+---------------------------+----------------------------+ | Id | UUID | Source Node | Dest Node | Source Compute | Dest Compute | Dest Host | Status | Server UUID | Old Flavor | New Flavor | Type | Created At | Updated At | +--------+---------------------------+---------------------------+---------------------------+----------------+--------------+--------------+-----------+---------------------------+------------+------------+----------------+---------------------------+----------------------------+ | 138991 | 7d0f464b-2fea-49b0-87e1- | None | None | compute-a2b1 | None | None | error | 982c3ead-59b1-4acd-876b- | 982 | 982 | live-migration | 2024-01- | 2024-01-09T09:56:50.000000 | | | 596624ca03cc | | | | | | | d55166d8e7f0 | | | | 09T09:56:45.000000 | | | 138931 | e1eeef36-f4e9-4f2a-adc6- | compute- | compute- | compute-a2b1 | compute-t2b2 | XXXXXXXXXXXX | error | 982c3ead-59b1-4acd-876b- | 982 | 831 | resize | 2024-01- | 2024-01-08T14:11:14.000000 | | | 985413cea4ed | a2b1.cld.domain.tld | t2b2.cld.domain.tld | | | | | d55166d8e7f0 | | | | 08T14:11:13.000000 | | | 138062 | c12a2ae0-969f-454f-b6d0- | compute- | compute- | compute-t2a3 | compute-a2b1 | XXXXXXXXXXXX | confirmed | 982c3ead-59b1-4acd-876b- | 831 | 982 | resize | 2023-12- | 2023-12-04T21:18:04.000000 | | | fa79381ba29f | t2a3.cld.domain.tld | a2b1.cld.domain.tld | | | | | d55166d8e7f0 | | | | 04T21:17:44.000000 | | | 137966 | 12e3cffe-caeb-44ce- | compute- | compute- | compute-t2c1 | compute-t2a3 | None | completed | 982c3ead-59b1-4acd-876b- | 831 | 831 | live-migration | 2023-12- | 2023-12-02T16:35:19.000000 | | | ac5a-baa0aa17d6e1 | t2c1.cld.domain.tld | t2a3.cld.domain.tld | | | | | d55166d8e7f0 | | | | 02T16:14:13.000000 | | | 137732 | 2ce96b71-143e-46cf-a7b2- | compute- | compute- | compute-t2a3 | compute-t2c1 | None | completed | 982c3ead-59b1-4acd-876b- | 831 | 831 | live-migration | 2023-12- | 2023-12-01T09:24:13.000000 | | | 822b2308f60a | t2a3.cld.domain.tld | t2c1.cld.domain.tld | | | | | d55166d8e7f0 | | | | 01T09:23:32.000000 | | | 137286 | 1c6c0e7d-cf33-4522-9710- | compute- | compute- | compute-t2c3 | compute-t2a3 | None | completed | 982c3ead-59b1-4acd-876b- | 831 | 831 | live-migration | 2023-11- | 2023-11-30T00:34:06.000000 | | | e372392f3dad | t2c3.cld.domain.tld | t2a3.cld.domain.tld | | | | | d55166d8e7f0 | | | | 29T23:42:59.000000 | | | 137013 | 4afff1ec-08fd-4995-8642- | compute- | compute- | compute-t2a3 | compute-t2c3 | None | completed | 982c3ead-59b1-4acd-876b- | 831 | 831 | live-migration | 2023-11- | 2023-11-28T09:55:35.000000 | | | b8341a169efb | t2a3.cld.domain.tld | t2c3.cld.domain.tld | | | | | d55166d8e7f0 | | | | 28T09:54:53.000000 | | | 135478 | cf28ff61-87d7-49d8-97b5- | compute- | compute- | compute-t2c3 | compute-t2a3 | None | completed | 982c3ead-59b1-4acd-876b- | 831 | 831 | live-migration | 2023-11- | 2023-11-09T06:52:46.000000 | | | 10382163d158 | t2c3.cld.domain.tld | t2a3.cld.domain.tld | | | | | d55166d8e7f0 | | | | 09T06:48:44.000000 | | | 135244 | 30f87d2e-ffc5-43ea- | compute- | compute- | compute-t2a3 | compute-t2c3 | None | completed | 982c3ead-59b1-4acd-876b- | 831 | 831 | live-migration | 2023-11- | 2023-11-08T20:49:47.000000 | | | ae16-ade2c9a553b3 | t2a3.cld.domain.tld | t2c3.cld.domain.tld | | | | | d55166d8e7f0 | | | | 08T20:49:01.000000 | | +--------+---------------------------+---------------------------+---------------------------+----------------+--------------+--------------+-----------+---------------------------+------------+------------+----------------+---------------------------+----------------------------+ Yet, even the live-migration tried later lists the correct flavor ID. My problem isn't much about the bug that data is inconsistent, especially on failures. We know that this often happens with each OpenStack version, and had to fix the database many times before. Our problem here is the complexity of fixing the inconsistencies because most are serialized Python objects. Are there any tools or commands, automatic or manual, to check and fix these request spec data inconsistencies? Maybe similar to the heal_placements command? At the moment, I cannot even tell how many instances are affected. We use scheduler filters to isolated users and projects between hosts too, even if technically compatible. Therefore, inconsistent data like that would not fail a live migration, but our security and isolation boundaries. I need to manually check each instance. Steps to reproduce ================== I cannot provide commands that produce this data inconsistency yet, but when manually introduced to a system, live migrations can fail because the scheduler makes wrong decisions. Expected result =============== All places where nova stores the flavor details should be kept in sync, or at least be fixable/resyncable on failures. Actual result ============= Scheduler are run with wrong data, violating scheduling constraints, such as compatibility, security and isolation boundaries. Is the case of compatibility, other operations, such as live migrations, will fail. In other cases, no apparent error might happen. Environment =========== 1. Exact version of OpenStack you are running. See the following list for all releases: http://docs.openstack.org/releases/ ii nova-common 3:25.2.1-0ubuntu1 all OpenStack Compute - common files ii nova-conductor 3:25.2.1-0ubuntu1 all OpenStack Compute - conductor service ii nova-scheduler 3:25.2.1-0ubuntu1 all OpenStack Compute - virtual machine scheduler ii nova-spiceproxy 3:25.2.1-0ubuntu1 all OpenStack Compute - spice html5 proxy ii python3-nova 3:25.2.1-0ubuntu1 all OpenStack Compute Python 3 libraries 2. Which hypervisor did you use? Libvirt + KVM 2. Which storage type did you use? Ceph 17.2.7-1focal, local qcow2 disks 3. Which networking type did you use? Neutron ML2/LXB Logs & Configs ============== The tool *sosreport* has support for some OpenStack projects. It's worth having a look at it. For example, if you want to collect the logs of a compute node you would execute: $ sudo sosreport -o openstack_nova --batch on that compute node. Attach the logs to this bug report. Please consider that these logs need to be collected in "DEBUG" mode.