request_spec out of sync with instance details

Bug #2049030 reported by Jan Graichen
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
New
Undecided
Unassigned

Bug Description

Description
===========

We have an OpenStack Zed running based on the Ubuntu cloud archive packages. We regularly live-migrate instances between hypervisors for host maintenance, but suddenly had issues with many instances. Nova correctly refused to migrate an instance from an AMD CPU to an Intel CPU. We have different flavors for different CPU kinds and scheduler filter that limit these flavors to specific hosts, but here the scheduler failed.

I picked one failing instance and started debugging. The flavor was `a2.medium` (ID: 982), a flavor running on AMD CPUs, but our scheduler filter was called with a request spec containing an `e2.medium` (ID: 831) flavor (Intel CPU). It therefore filtered for the wrong hosts, and Nova aborted the live migration because of an invalid target.

The request spec the filter received looked like this:

    RequestSpec(
        availability_zone=None,
        flavor=Flavor(831),
        force_hosts=None,
        force_nodes=None,
        id=485269,
        ignore_hosts=[...],
        image=ImageMeta(35c8d5e3-2791-4565-9c19-291869fde98d),
        instance_group=None,
        instance_uuid=982c3ead-59b1-4acd-876b-d55166d8e7f0,
        is_bfv=False,
        limits=SchedulerLimits,
        network_metadata=NetworkMetadata,
        num_instances=1,
        numa_topology=None,
        pci_requests=InstancePCIRequests,
        project_id='16e980bb63b4415694dd2130f5977b8b',
        request_level_params=RequestLevelParams,
        requested_destination=Destination,
        requested_networks=NetworkRequestList,
        requested_resources=[],
        retry=None,
        scheduler_hints={},
        security_groups=SecurityGroupList,
        user_id='d7451e969b3f4229bd1869ed9ad591f3'
    )

Despite that, the instance itself is listed with the correct flavor:

    $ openstack server show 982c3ead-59b1-4acd-876b-d55166d8e7f0 | grep flavor
    | flavor | disk='80', ephemeral='0', original_name='a2.medium', ram='8192', swap='0', vcpus='4'

Here is an except of our flavors:

    MariaDB [novaapi]> select id,flavorid,name,vcpus,memory_mb,root_gb from flavors where name LIKE 'a2.%' OR name LIKE 'e2.%';
    +------+----------+------------+-------+-----------+---------+
    | id | flavorid | name | vcpus | memory_mb | root_gb |
    +------+----------+------------+-------+-----------+---------+
    | 807 | 022030 | e2.micro | 2 | 2048 | 25 |
    | 813 | 022010 | e2.nano | 1 | 512 | 10 |
    | 819 | 022060 | e2.large | 4 | 12288 | 40 |
    | 822 | 022020 | e2.tiny | 2 | 1024 | 20 |
    | 825 | 022070 | e2.xlarge | 4 | 16384 | 60 |
    | 828 | 022080 | e2.2xlarge | 8 | 32768 | 80 |
    | 831 | 022050 | e2.medium | 4 | 8192 | 25 |
    | 834 | 022040 | e2.small | 2 | 4096 | 25 |
    | 980 | 022090 | e2.4xlarge | 16 | 65536 | 160 |
    | 982 | 026050 | a2.medium | 4 | 8192 | 80 |
    | 1003 | 026040 | a2.small | 2 | 4096 | 40 |
    | 1005 | 026060 | a2.large | 6 | 12288 | 120 |
    | 1008 | 026070 | a2.xlarge | 8 | 16384 | 160 |
    | 1011 | 026090 | a2.4xlarge | 32 | 65536 | 640 |
    | 1014 | 026080 | a2.2xlarge | 16 | 32768 | 320 |
    +------+----------+------------+-------+-----------+---------+

The instance also has the correct flavor listed in Horizon, CLI and in the MySQL database (instance_type_id). It is running (according to the database and actual) on the correct compute host (compute-a2b1).

    MariaDB [nova]> select * from instances where uuid = '982c3ead-59b1-4acd-876b-d55166d8e7f0' \G
    *************************** 1. row ***************************
        ...
             launched_on: compute-t2a3
        instance_type_id: 982 <==========================================
                    uuid: 982c3ead-59b1-4acd-876b-d55166d8e7f0
        ...
                    node: compute-a2b1.cld.domain.tld

Yet, nova-scheduler runs the filter with the wrong request_spec object shown above. I've followed the `request_spec` via source code and print statements through nova-scheduler, nova-conductor, to nova-api, where it was loaded in https://github.com/openstack/nova/blob/stable/zed/nova/compute/api.py#L5496. Here, the request spec is loaded already with the wrong flavor ID (831 instead of 982). A look at the database confirmed that:

    MariaDB [novaapi]> select * from request_specs where instance_uuid = '982c3ead-59b1-4acd-876b-d55166d8e7f0' \G
    *************************** 1. row ***************************
        created_at: 2023-11-06 14:44:22
        updated_at: 2024-01-08 14:11:14
                id: 485269
     instance_uuid: 982c3ead-59b1-4acd-876b-d55166d8e7f0
              spec: {
                      "nova_object.name": "RequestSpec",
                      "nova_object.namespace": "nova",
                      "nova_object.version": "1.14",
                      "nova_object.data": {
                        "id": 485269,
                        "image": {
                          "nova_object.name": "ImageMeta",
                          "nova_object.namespace": "nova",
                          "nova_object.version": "1.8",
                          "nova_object.data": {
                            "id": "35c8d5e3-2791-4565-9c19-291869fde98d",
                            "name": "ubuntu-22.04",
                            "status": "active",
                            "checksum": "83cae4f849822d728ef36885b0845340",
                            "owner": "b18c2da2dbfa45138fb6077eafb2aa51",
                            "size": 2361393152,
                            "container_format": "bare",
                            "disk_format": "raw",
                            "created_at": "2023-10-27T02:03:21Z",
                            "updated_at": "2023-10-27T02:05:28Z",
                            "min_ram": 128,
                            "min_disk": 5,
                            "properties": {
                              "nova_object.name": "ImageMetaProps",
                              "nova_object.namespace": "nova",
                              "nova_object.version": "1.31",
                              "nova_object.data": {
                                "hw_architecture": "x86_64",
                                "hw_disk_bus": "scsi",
                                "hw_qemu_guest_agent": true,
                                "hw_scsi_model": "virtio-scsi",
                                "hw_vm_mode": "hvm",
                                "img_hv_type": "kvm",
                                "os_admin_user": "ubuntu",
                                "os_distro": "ubuntu",
                                "os_require_quiesce": true,
                                "os_type": "linux"
                              },
                              "nova_object.changes": [
                                "hw_vm_mode",
                                "img_hv_type",
                                "hw_qemu_guest_agent",
                                "os_type",
                                "hw_architecture",
                                "os_distro",
                                "os_admin_user",
                                "os_require_quiesce",
                                "hw_scsi_model",
                                "hw_disk_bus"
                              ]
                            }
                          },
                          "nova_object.changes": [
                            "checksum",
                            "status",
                            "container_format",
                            "min_disk",
                            "size",
                            "min_ram",
                            "owner",
                            "id",
                            "name",
                            "updated_at",
                            "properties",
                            "created_at",
                            "disk_format"
                          ]
                        },
                        "numa_topology": null,
                        "pci_requests": {
                          "nova_object.name": "InstancePCIRequests",
                          "nova_object.namespace": "nova",
                          "nova_object.version": "1.1",
                          "nova_object.data": {
                            "instance_uuid": "982c3ead-59b1-4acd-876b-d55166d8e7f0",
                            "requests": []
                          },
                          "nova_object.changes": ["requests", "instance_uuid"]
                        },
                        "project_id": "16e980bb63b4415694dd2130f5977b8b",
                        "user_id": "d7451e969b3f4229bd1869ed9ad591f3",
                        "availability_zone": null,
                        "flavor": {
                          "nova_object.name": "Flavor",
                          "nova_object.namespace": "nova",
                          "nova_object.version": "1.2",
                          "nova_object.data": {
                            "id": 831,
                            "name": "e2.medium",
                            "memory_mb": 8192,
                            "vcpus": 4,
                            "root_gb": 25,
                            "ephemeral_gb": 0,
                            "flavorid": "022050",
                            "swap": 0,
                            "rxtx_factor": 1.0,
                            "vcpu_weight": 0,
                            "disabled": false,
                            "is_public": true,
                            "extra_specs": {
                              "hw:cpu_max_sockets": "1",
                              "hw:cpu_policy": "shared",
                              "quota:cpu_shares": "400"
                            },
                            "description": null,
                            "created_at": "2022-08-03T11:11:49Z",
                            "updated_at": null,
                            "deleted_at": null,
                            "deleted": false
                          },
                          "nova_object.changes": ["extra_specs"]
                        },
                        "num_instances": 1,
                        "ignore_hosts": [],
                        "force_hosts": null,
                        "force_nodes": null,
                        "requested_destination": null,
                        "retry": null,
                        "limits": {
                          "nova_object.name": "SchedulerLimits",
                          "nova_object.namespace": "nova",
                          "nova_object.version": "1.0",
                          "nova_object.data": {
                            "numa_topology": null,
                            "vcpu": null,
                            "disk_gb": null,
                            "memory_mb": null
                          },
                          "nova_object.changes": ["vcpu", "memory_mb", "disk_gb", "numa_topology"]
                        },
                        "instance_group": null,
                        "scheduler_hints": {},
                        "instance_uuid": "982c3ead-59b1-4acd-876b-d55166d8e7f0",
                        "security_groups": {
                          "nova_object.name": "SecurityGroupList",
                          "nova_object.namespace": "nova",
                          "nova_object.version": "1.1",
                          "nova_object.data": {
                            "objects": [
                              {
                                "nova_object.name": "SecurityGroup",
                                "nova_object.namespace": "nova",
                                "nova_object.version": "1.2",
                                "nova_object.data": {
                                  "uuid": "a0446aa5-0e22-4f4a-b1f5-0abc68fbed6f"
                                },
                                "nova_object.changes": ["uuid"]
                              },
                              {
                                "nova_object.name": "SecurityGroup",
                                "nova_object.namespace": "nova",
                                "nova_object.version": "1.2",
                                "nova_object.data": {
                                  "uuid": "520911dd-9f1d-442b-936b-d35d4c87190c"
                                },
                                "nova_object.changes": ["uuid"]
                              },
                              {
                                "nova_object.name": "SecurityGroup",
                                "nova_object.namespace": "nova",
                                "nova_object.version": "1.2",
                                "nova_object.data": { "name": "default" },
                                "nova_object.changes": ["name"]
                              }
                            ]
                          },
                          "nova_object.changes": ["objects"]
                        },
                        "is_bfv": false,
                        "requested_resources": []
                      },
                      "nova_object.changes": [
                        "limits",
                        "requested_destination",
                        "security_groups",
                        "flavor",
                        "numa_topology",
                        "ignore_hosts",
                        "requested_resources",
                        "pci_requests",
                        "image"
                      ]
                    }
    1 row in set (0.000 sec)

I further identified the `instance_extra` data to be out of sync, with a "new" flavor present:

    MariaDB [nova]> select * from instance_extra where instance_uuid = '982c3ead-59b1-4acd-876b-d55166d8e7f0' \G
    *************************** 1. row ***************************
        created_at: 2023-11-06 14:44:23
        updated_at: 2024-01-10 11:26:50
        deleted_at: NULL
           deleted: 0
                id: 442795
     instance_uuid: 982c3ead-59b1-4acd-876b-d55166d8e7f0
     numa_topology: NULL
      pci_requests: []
            flavor: {
                      "cur": {
                        "nova_object.name": "Flavor",
                        "nova_object.namespace": "nova",
                        "nova_object.version": "1.2",
                        "nova_object.data": {
                          "id": 982,
                          "name": "a2.medium",
                          "memory_mb": 8192,
                          "vcpus": 4,
                          "root_gb": 80,
                          "ephemeral_gb": 0,
                          "flavorid": "026050",
                          "swap": 0,
                          "rxtx_factor": 1.0,
                          "vcpu_weight": 0,
                          "disabled": false,
                          "is_public": true,
                          "extra_specs": {
                            "hw:cpu_max_sockets": "1",
                            "hw:cpu_policy": "shared",
                            "os:secure_boot": "disabled",
                            "quota:cpu_shares": "400"
                          },
                          "description": null,
                          "created_at": "2023-03-29T15:06:03Z",
                          "updated_at": null,
                          "deleted_at": null,
                          "deleted": false
                        },
                        "nova_object.changes": ["extra_specs"]
                      },
                      "old": null,
                      "new": {
                        "nova_object.name": "Flavor",
                        "nova_object.namespace": "nova",
                        "nova_object.version": "1.2",
                        "nova_object.data": {
                          "id": 831,
                          "name": "e2.medium",
                          "memory_mb": 8192,
                          "vcpus": 4,
                          "root_gb": 25,
                          "ephemeral_gb": 0,
                          "flavorid": "022050",
                          "swap": 0,
                          "rxtx_factor": 1.0,
                          "vcpu_weight": 0,
                          "disabled": false,
                          "is_public": true,
                          "extra_specs": {
                            "hw:cpu_max_sockets": "1",
                            "hw:cpu_policy": "shared",
                            "quota:cpu_shares": "400"
                          },
                          "description": null,
                          "created_at": "2022-08-03T11:11:49Z",
                          "updated_at": null,
                          "deleted_at": null,
                          "deleted": false
                        },
                        "nova_object.changes": ["extra_specs"]
                      }
                    }
    device_metadata: NULL
      trusted_certs: NULL
             vpmems: NULL
          resources: NULL
    1 row in set (0.000 sec)

It appears that a user tried to resize the instance before, which failed (no idea why yet), and `instance_extra` as well as the `request_spec` data wasn't reverted correctly:

    $ openstack server migration list --server 982c3ead-59b1-4acd-876b-d55166d8e7f0
    +--------+---------------------------+---------------------------+---------------------------+----------------+--------------+--------------+-----------+---------------------------+------------+------------+----------------+---------------------------+----------------------------+
    | Id | UUID | Source Node | Dest Node | Source Compute | Dest Compute | Dest Host | Status | Server UUID | Old Flavor | New Flavor | Type | Created At | Updated At |
    +--------+---------------------------+---------------------------+---------------------------+----------------+--------------+--------------+-----------+---------------------------+------------+------------+----------------+---------------------------+----------------------------+
    | 138991 | 7d0f464b-2fea-49b0-87e1- | None | None | compute-a2b1 | None | None | error | 982c3ead-59b1-4acd-876b- | 982 | 982 | live-migration | 2024-01- | 2024-01-09T09:56:50.000000 |
    | | 596624ca03cc | | | | | | | d55166d8e7f0 | | | | 09T09:56:45.000000 | |
    | 138931 | e1eeef36-f4e9-4f2a-adc6- | compute- | compute- | compute-a2b1 | compute-t2b2 | XXXXXXXXXXXX | error | 982c3ead-59b1-4acd-876b- | 982 | 831 | resize | 2024-01- | 2024-01-08T14:11:14.000000 |
    | | 985413cea4ed | a2b1.cld.domain.tld | t2b2.cld.domain.tld | | | | | d55166d8e7f0 | | | | 08T14:11:13.000000 | |
    | 138062 | c12a2ae0-969f-454f-b6d0- | compute- | compute- | compute-t2a3 | compute-a2b1 | XXXXXXXXXXXX | confirmed | 982c3ead-59b1-4acd-876b- | 831 | 982 | resize | 2023-12- | 2023-12-04T21:18:04.000000 |
    | | fa79381ba29f | t2a3.cld.domain.tld | a2b1.cld.domain.tld | | | | | d55166d8e7f0 | | | | 04T21:17:44.000000 | |
    | 137966 | 12e3cffe-caeb-44ce- | compute- | compute- | compute-t2c1 | compute-t2a3 | None | completed | 982c3ead-59b1-4acd-876b- | 831 | 831 | live-migration | 2023-12- | 2023-12-02T16:35:19.000000 |
    | | ac5a-baa0aa17d6e1 | t2c1.cld.domain.tld | t2a3.cld.domain.tld | | | | | d55166d8e7f0 | | | | 02T16:14:13.000000 | |
    | 137732 | 2ce96b71-143e-46cf-a7b2- | compute- | compute- | compute-t2a3 | compute-t2c1 | None | completed | 982c3ead-59b1-4acd-876b- | 831 | 831 | live-migration | 2023-12- | 2023-12-01T09:24:13.000000 |
    | | 822b2308f60a | t2a3.cld.domain.tld | t2c1.cld.domain.tld | | | | | d55166d8e7f0 | | | | 01T09:23:32.000000 | |
    | 137286 | 1c6c0e7d-cf33-4522-9710- | compute- | compute- | compute-t2c3 | compute-t2a3 | None | completed | 982c3ead-59b1-4acd-876b- | 831 | 831 | live-migration | 2023-11- | 2023-11-30T00:34:06.000000 |
    | | e372392f3dad | t2c3.cld.domain.tld | t2a3.cld.domain.tld | | | | | d55166d8e7f0 | | | | 29T23:42:59.000000 | |
    | 137013 | 4afff1ec-08fd-4995-8642- | compute- | compute- | compute-t2a3 | compute-t2c3 | None | completed | 982c3ead-59b1-4acd-876b- | 831 | 831 | live-migration | 2023-11- | 2023-11-28T09:55:35.000000 |
    | | b8341a169efb | t2a3.cld.domain.tld | t2c3.cld.domain.tld | | | | | d55166d8e7f0 | | | | 28T09:54:53.000000 | |
    | 135478 | cf28ff61-87d7-49d8-97b5- | compute- | compute- | compute-t2c3 | compute-t2a3 | None | completed | 982c3ead-59b1-4acd-876b- | 831 | 831 | live-migration | 2023-11- | 2023-11-09T06:52:46.000000 |
    | | 10382163d158 | t2c3.cld.domain.tld | t2a3.cld.domain.tld | | | | | d55166d8e7f0 | | | | 09T06:48:44.000000 | |
    | 135244 | 30f87d2e-ffc5-43ea- | compute- | compute- | compute-t2a3 | compute-t2c3 | None | completed | 982c3ead-59b1-4acd-876b- | 831 | 831 | live-migration | 2023-11- | 2023-11-08T20:49:47.000000 |
    | | ae16-ade2c9a553b3 | t2a3.cld.domain.tld | t2c3.cld.domain.tld | | | | | d55166d8e7f0 | | | | 08T20:49:01.000000 | |
    +--------+---------------------------+---------------------------+---------------------------+----------------+--------------+--------------+-----------+---------------------------+------------+------------+----------------+---------------------------+----------------------------+

Yet, even the live-migration tried later lists the correct flavor ID.

My problem isn't much about the bug that data is inconsistent, especially on failures. We know that this often happens with each OpenStack version, and had to fix the database many times before.

Our problem here is the complexity of fixing the inconsistencies because most are serialized Python objects.

Are there any tools or commands, automatic or manual, to check and fix these request spec data inconsistencies? Maybe similar to the heal_placements command?

At the moment, I cannot even tell how many instances are affected. We use scheduler filters to isolated users and projects between hosts too, even if technically compatible. Therefore, inconsistent data like that would not fail a live migration, but our security and isolation boundaries. I need to manually check each instance.

Steps to reproduce
==================

I cannot provide commands that produce this data inconsistency yet, but when manually introduced to a system, live migrations can fail because the scheduler makes wrong decisions.

Expected result
===============

All places where nova stores the flavor details should be kept in sync, or at least be fixable/resyncable on failures.

Actual result
=============

Scheduler are run with wrong data, violating scheduling constraints, such as compatibility, security and isolation boundaries. Is the case of compatibility, other operations, such as live migrations, will fail. In other cases, no apparent error might happen.

Environment
===========

1. Exact version of OpenStack you are running. See the following
   list for all releases: http://docs.openstack.org/releases/

    ii nova-common 3:25.2.1-0ubuntu1 all OpenStack Compute - common files
    ii nova-conductor 3:25.2.1-0ubuntu1 all OpenStack Compute - conductor service
    ii nova-scheduler 3:25.2.1-0ubuntu1 all OpenStack Compute - virtual machine scheduler
    ii nova-spiceproxy 3:25.2.1-0ubuntu1 all OpenStack Compute - spice html5 proxy
    ii python3-nova 3:25.2.1-0ubuntu1 all OpenStack Compute Python 3 libraries

2. Which hypervisor did you use?
   Libvirt + KVM

2. Which storage type did you use?
   Ceph 17.2.7-1focal, local qcow2 disks

3. Which networking type did you use?
   Neutron ML2/LXB

Logs & Configs
==============

The tool *sosreport* has support for some OpenStack projects.
It's worth having a look at it. For example, if you want to collect
the logs of a compute node you would execute:

   $ sudo sosreport -o openstack_nova --batch

on that compute node. Attach the logs to this bug report. Please
consider that these logs need to be collected in "DEBUG" mode.

Jan Graichen (jgraichen)
description: updated
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.