instance_extra corrupts on N-1 cellsv2 upgrade

Bug #2022967 reported by Bjoern
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
New
Undecided
Unassigned

Bug Description

We upgraded a large cellsv2 deployment from Train (nova 20.6.1) to Ussuri (nova 21.2.5.dev27) where the cell0 control plane is upgraded
and the cell controllers are all on the same nova version.
We only left the nova-compute nodes running at the prior version to do a upgrade cell by cell.

But now we realized we got the nova-conductor reporting errors like

ERROR nova.compute.manager [req-967855b9-6938-4ca0-b7b9-dcf0f5af9402 - - - - -] Error updating resources for node sc9-1-hv329: oslo_messaging.rpc.client.RemoteError: Remote error: JSONDecodeError Expecting value: line 1 column 1 (char 0)
Jun 05 13:02:36 sc9-1-hv329 nova-compute[40856]: ['Traceback (most recent call last):\n', ' File "/openstack/venvs/nova-21.2.13.dev6/lib/python3.6/site-packages/nova/conductor/manager.py", line 139, in _object_dispatch\n return getattr(target, method)(*args, **kwargs)\n', ' File "/openstack/venvs/nova-21.2.13.dev6/lib/python3.6/site-packages/oslo_versionedobjects/base.py", line 184, in wrapper\n result = fn(cls, context, *args, **kwargs)\n', ' File "/openstack/venvs/nova-21.2.13.dev6/lib/python3.6/site-packages/nova/objects/instance.py", line 1333, in get_by_host_and_node\n expected_attrs)\n', ' File "/openstack/venvs/nova-21.2.13.dev6/lib/python3.6/site-packages/nova/objects/instance.py", line 1238, in _make_instance_list\n expected_attrs=expected_attrs)\n', ' File "/openstack/venvs/nova-21.2.13.dev6/lib/python3.6/site-packages/nova/objects/instance.py", line 441, in _from_db_object\n expected_attrs)\n', ' File "/openstack/venvs/nova-21.2.13.dev6/lib/python3.6/site-packages/nova/objects/instance.py", line 502, in _extra_attributes_from_db_object\n db_inst[\'extra\'].get(\'resources\'))\n', ' File "/openstack/venvs/nova-21.2.13.dev6/lib/python3.6/site-packages/nova/objects/instance.py", line 1025, in _load_resources\n jsonutils.loads(db_resources))\n', ' File "/openstack/venvs/nova-21.2.13.dev6/lib/python3.6/site-packages/oslo_serialization/jsonutils.py", line 249, in loads\n return json.loads(encodeutils.safe_decode(s, encoding), **kwargs)\n', ' File "/usr/lib/python3.6/json/__init__.py", line 354, in loads\n return _default_decoder.decode(s)\n', ' File "/usr/lib/python3.6/json/decoder.py", line 339, in decode\n obj, end = self.raw_decode(s, idx=_w(s, 0).end())\n', ' File "/usr/lib/python3.6/json/decoder.py", line 357, in raw_decode\n raise JSONDecodeError("Expecting value", s, err.value) from None\n', 'json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)\n'].

This error now prevents nova-compute from starting instances once they are stopped.
So far we tracked it down to a table nova.instance_extra corruption at the individual cell level when looking pre vs post upgrade.
The corruption seem to happen within the keypairs and following columns of the table indicating a shift in a python class/structure.

Pre Upgrade

MariaDB [(none)]> select * from nova.instance_extra where instance_uuid = 'bd3ad637-3291-454b-95e3-d498ce0f81bd'\G
*************************** 1. row ***************************
       created_at: 2023-06-02 20:42:11
       updated_at: 2023-06-02 20:43:48
       deleted_at: NULL
          deleted: 0
               id: 260958
    instance_uuid: bd3ad637-3291-454b-95e3-d498ce0f81bd
    numa_topology: {"nova_object.name": "InstanceNUMATopology", "nova_object.namespace": "nova", "nova_object.version": "1.3", "nova_object.data": {"cells": [{"nova_object.name": "InstanceNUMACell", "nova_object.namespace": "nova", "nova_object.version": "1.4", "nova_object.data": {"id": 0, "cpuset": [0], "memory": 512, "pagesize": null, "cpu_topology": {"nova_object.name": "VirtCPUTopology", "nova_object.namespace": "nova", "nova_object.version": "1.0", "nova_object.data": {"sockets": 1, "cores": 1, "threads": 1}, "nova_object.changes": ["threads", "cores", "sockets"]}, "cpu_pinning_raw": {"0": 17}, "cpu_policy": "dedicated", "cpu_thread_policy": "prefer", "cpuset_reserved": null}, "nova_object.changes": ["cpu_topology", "cpuset_reserved", "cpu_pinning_raw", "id", "pagesize"]}], "emulator_threads_policy": null}, "nova_object.changes": ["emulator_threads_policy", "cells"]}
     pci_requests: []
           flavor: {"cur": {"nova_object.name": "Flavor", "nova_object.namespace": "nova", "nova_object.version": "1.2", "nova_object.data": {"id": 611, "name": "g.tiny.single_core", "memory_mb": 512, "vcpus": 1, "root_gb": 1, "ephemeral_gb": 0, "flavorid": "d3c79ed9-a6d8-4fb9-88a0-c70739c90c36", "swap": 0, "rxtx_factor": 1.0, "vcpu_weight": 0, "disabled": false, "is_public": true, "extra_specs": {"aggregate_instance_extra_specs:selection": "batch.x86_64.2x2", "hw:cpu_policy": "dedicated", "hw:cpu_threads": "1", "hw:cpu_thread_policy": "prefer"}, "description": null, "created_at": "2021-04-27T21:26:16Z", "updated_at": null, "deleted_at": null, "deleted": false}, "nova_object.changes": ["extra_specs"]}, "old": null, "new": null}
       vcpu_model: {"nova_object.name": "VirtCPUModel", "nova_object.namespace": "nova", "nova_object.version": "1.0", "nova_object.data": {"arch": null, "vendor": null, "topology": {"nova_object.name": "VirtCPUTopology", "nova_object.namespace": "nova", "nova_object.version": "1.0", "nova_object.data": {"sockets": 1, "cores": 1, "threads": 1}, "nova_object.changes": ["threads", "cores", "sockets"]}, "features": [], "mode": "host-passthrough", "model": null, "match": "exact"}, "nova_object.changes": ["features", "vendor", "topology", "model", "mode", "match", "arch"]}
migration_context: NULL
         keypairs: {"nova_object.name": "KeyPairList", "nova_object.namespace": "nova", "nova_object.version": "1.3", "nova_object.data": {"objects": [{"nova_object.name": "KeyPair", "nova_object.namespace": "nova", "nova_object.version": "1.4", "nova_object.data": {"id": 10, "name": "rpc_support", "user_id": "5f1bf3f91c2d4ab7b46c13441dc0952f", "fingerprint": "c5:79:2a:70:6a:f9:32:65:16:39:d4:45:9f:d1:86:21", "public_key": "ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCelh7W66McTxWeCM+eqRlxtRse8sTHLA+6vzmeNX4b+dyVwvuhVFt4xd0nr42CVx8pz7dfZUgeVUSLoURFvvTNPpt2TTn1gITFHUga0hwiGkVWtpz0y4pVzyNDQZUUMHbuLGU+E++8RHVkxTplhclTD57+fhGZdu8VV1Rh8ZL+UStKqlY1YUDP1NubJ8kMhUbllYXeCa3pC5L+vA0svHVe/Or1hV2Ls7xtYVFdlgrwKmJ8lNi4yJZOW02f/b3YcsFTjAe+ic2RK2HGhDOGxD11ALBFT8SF419mMq+m14eXiOfG6jbavzWCrMBGXTi/gwBqRHslNAqpu7TcsvCyIIP7 root@mcp-ctrl01\n", "type": "ssh", "created_at": "2019-11-13T02:10:53Z", "updated_at": null, "deleted_at": null, "deleted": false}}]}}
  device_metadata: NULL
    trusted_certs: NULL
           vpmems: NULL
        resources: NULL

Post Cell0 and cell controller upgrade:

After stop the instance_extra got corrupted (keypairs columns and following) and you can no longer start it unless you fix the table back to the previous state
This is post cell controller upgrade with running nova-compute at train, a restart of the serivce doesn't change the situation

MariaDB [(none)]> select * from nova.instance_extra where instance_uuid = 'bd3ad637-3291-454b-95e3-d498ce0f81bd'\G
*************************** 1. row ***************************
       created_at: 2023-06-02 20:42:11
       updated_at: 2023-06-05 17:19:51
       deleted_at: NULL
          deleted: 0
               id: 260958
    instance_uuid: bd3ad637-3291-454b-95e3-d498ce0f81bd
    numa_topology: {"nova_object.name": "InstanceNUMATopology", "nova_object.namespace": "nova", "nova_object.version": "1.3", "nova_object.data": {"cells": [{"nova_object.name": "InstanceNUMACell", "nova_object.namespace": "nova", "nova_object.version": "1.4", "nova_object.data": {"id": 0, "cpuset": [0], "memory": 512, "pagesize": null, "cpu_topology": {"nova_object.name": "VirtCPUTopology", "nova_object.namespace": "nova", "nova_object.version": "1.0", "nova_object.data": {"sockets": 1, "cores": 1, "threads": 1}, "nova_object.changes": ["threads", "cores", "sockets"]}, "cpu_pinning_raw": {"0": 17}, "cpu_policy": "dedicated", "cpu_thread_policy": "prefer", "cpuset_reserved": null}, "nova_object.changes": ["cpu_topology", "cpuset_reserved", "cpu_pinning_raw", "id", "pagesize"]}], "emulator_threads_policy": null}, "nova_object.changes": ["emulator_threads_policy", "cells"]}
     pci_requests: []
           flavor: {"cur": {"nova_object.name": "Flavor", "nova_object.namespace": "nova", "nova_object.version": "1.2", "nova_object.data": {"id": 611, "name": "g.tiny.single_core", "memory_mb": 512, "vcpus": 1, "root_gb": 1, "ephemeral_gb": 0, "flavorid": "d3c79ed9-a6d8-4fb9-88a0-c70739c90c36", "swap": 0, "rxtx_factor": 1.0, "vcpu_weight": 0, "disabled": false, "is_public": true, "extra_specs": {"aggregate_instance_extra_specs:selection": "batch.x86_64.2x2", "hw:cpu_policy": "dedicated", "hw:cpu_threads": "1", "hw:cpu_thread_policy": "prefer"}, "description": null, "created_at": "2021-04-27T21:26:16Z", "updated_at": null, "deleted_at": null, "deleted": false}, "nova_object.changes": ["extra_specs"]}, "old": null, "new": null}
       vcpu_model: {"nova_object.name": "VirtCPUModel", "nova_object.namespace": "nova", "nova_object.version": "1.0", "nova_object.data": {"arch": null, "vendor": null, "topology": {"nova_object.name": "VirtCPUTopology", "nova_object.namespace": "nova", "nova_object.version": "1.0", "nova_object.data": {"sockets": 1, "cores": 1, "threads": 1}, "nova_object.changes": ["threads", "cores", "sockets"]}, "features": [], "mode": "host-passthrough", "model": null, "match": "exact"}, "nova_object.changes": ["features", "vendor", "topology", "model", "mode", "match", "arch"]}
migration_context: NULL
         keypairs: {"nova_object.name": "KeyPairList", "nova_object.namespace": "nova", "nova_object.version": "1.3", "nova_object.data": {"objects": [{"nova_object.name": "KeyPair", "nova_object.namespace": "nova", "nova_object.version": "1.4", "nova_object.data": {"id": 10, "name": "rpc_support", "user_id": "5f1bf3f91c2d4ab7b46c13441dc0952f", "fingerprint": "c5:79:2a:70:6a:f9:32:65:16:39:d4:45:9f:d1:86:21", "public_key": "ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCelh7W66McTxWeCM+eqRlxtRse8sTHLA+6vzmeNX4b+dyVwvuhVFt4xd0nr42CVx8pz7dfZUgeVUSLoURFvvTNPpt2TTn1gITFHUga0hwiGkVWtpz0y4pVzyNDQZUUMHbuLGU+E++8RHVkxTplhclTD57+fhGZdu8VV1Rh8ZL+UStKqlY1YUDP1NubJ8kMhUbllYXeCa3pC5L+vA0svHVe/Or1hV2Ls7xtYVFdlgrwKmJ8lNi4yJZOW02f/b3YcsFTjAe+ic2RK2HGhDOGxD11ALBFT8SF419mMq+m14eXiOfG6jbavzWCrMBGXTi/gwBqRHslNAqpu7TcsvCyIIP7 root@mcp-ctrl01\n", "type": "ssh", "created_at": "2019-11-13T02:10:53Z", "updated_at": null, "deleted_at": null, "deleted": false}}]¤ƒ
  device_metadata: NULL
    trusted_certs: NULL
           vpmems: +‚΂bƒ$=
                            W€ûd € ™°EJ‹™°Jô¨€ c6ed9384-dc62-4c88-b9f4-fb3eee03b025{"nova_object.name": "Insta
        resources: nceNUMATopology", "nova_object.namespace": "nova", "nova_object.version": "1.3", "nova_object.data": {"cells": [{"nov

At this point we are accelerating the nova-compute upgrades to see if that fixes it
If that is the case then N-1 is not working with respect to a cellsv2 deployment.
So far we haven't found the issue in the code yet and would appreciate feedback where to look

Bjoern (bjoern-t)
description: updated
description: updated
Bjoern (bjoern-t)
summary: - instance_extra corrupts on N-1 cells upgrade
+ instance_extra corrupts on N-1 cellsv2 upgrade
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.