Booting instance with pci_device fails during rocky->stein live upgrade

Bug #1868033 reported by Sam Morrison
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Dan Smith
Stein
Fix Released
Undecided
Unassigned
Train
Fix Released
Undecided
Unassigned
Ussuri
Fix Released
Undecided
Unassigned

Bug Description

Environment:

Stein nova-conductor having set upgrade_levels to rocky
Rocky nova-compute

Boot an instance with a flavour that has a pci_device

Error:

Failed to publish message to topic 'nova': maximum recursion depth exceeded: RuntimeError: maximum recursion depth exceeded

Tracked this down it it continually trying to backport the InstancePCIRequests:

It gets as arguments:
objinst={u'nova_object.version': u'1.1', u'nova_object.name': u'InstancePCIRequests', u'nova_object.data': {u'instance_uuid': u'08212b12-8fa8-42d9-8d3e-52ed60a64135', u'requests': [{u'nova_object.version': u'1.3', u'nova_object.name': u'InstancePCIRequest', u'nova_object.data': {u'count': 1, u'is_new': False, u'numa_policy': None, u'request_id': None, u'requester_id': None, u'alias_name': u'V100-32G', u'spec': [{u'vendor_id': u'10de', u'product_id': u'1db6'}]}, u'nova_object.namespace': u'nova'}]}, u'nova_object.namespace': u'nova'},

object_versions={u'InstancePCIRequests': '1.1', 'InstancePCIRequest': '1.2'}

It fails because it doesn't backport the individual InstancePCIRequest inside the InstancePCIRequests object and so keeps trying.

Error it shows is: IncompatibleObjectVersion: Version 1.3 of InstancePCIRequest is not supported, supported version is 1.2

I have fixed this in our setup by altering obj_make_compatible to downgrade the individual requests to version 1.2 which seems to work and all is good

Revision history for this message
Sam Morrison (sorrison) wrote :

For reference this is the hotfix we've deployed in production

https://github.com/NeCTAR-RC/nova/commit/cbb2921375df6cfc33273ab84c29b8b309885c04

Changed in nova:
importance: Undecided → High
tags: added: upgrade
tags: added: pci
Changed in nova:
status: New → Confirmed
assignee: nobody → Stephen Finucane (stephenfinucane)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/721667

Changed in nova:
assignee: Stephen Finucane (stephenfinucane) → Dan Smith (danms)
status: Confirmed → In Progress
Revision history for this message
Dan Smith (danms) wrote :

Sam, could you test this patch in place of yours for us? It'll be difficult for us to test something as old as Stein with PCI.

Thanks!

Revision history for this message
Sam Morrison (sorrison) wrote :

Hi Dan,

We are now fully at stein so we're not in a position to test this easily either now sadly.
From the looks of the code it looks like the way to go

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/721667
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=d3ca7356860d64555eef6f5138501cb38f50ecc8
Submitter: Zuul
Branch: master

commit d3ca7356860d64555eef6f5138501cb38f50ecc8
Author: Dan Smith <email address hidden>
Date: Tue Apr 21 09:07:32 2020 -0700

    Remove stale nested backport from InstancePCIRequests

    Sometime in 2015, we removed the hard-coded obj_relationships mapping
    from parent objects which facilitated semi-automated child version
    backports. This was replaced by a manifest-of-versions mechanism
    where the client reports all the supported objects and versions
    during a backport request to conductor. The InstancePCIRequests object
    isn't technically an ObjectListBase, despite acting like one, and thus
    wasn't using the obj_relationships. Because of this, it was doing
    its own backporting of its child object, which was not removed in
    the culling of the static mechanism. Because we now no longer need to
    worry about sub-object backport chaining, when version 1.2 was added,
    no backport rule was added, and since the object does not call the
    base class' generic routine, proper backporting of the child object
    was not happening.

    All we need to do is remove the override to allow the base
    infrastructure to do the work.

    Change-Id: Id610a24c066707de5ddc0507e7ef26c421ba366c
    Closes-Bug: #1868033

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/ussuri)

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/725931

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/725932

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/ussuri)

Reviewed: https://review.opendev.org/725931
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=e61d0025303b33ef00aa95ebd934f6121d320cbb
Submitter: Zuul
Branch: stable/ussuri

commit e61d0025303b33ef00aa95ebd934f6121d320cbb
Author: Dan Smith <email address hidden>
Date: Tue Apr 21 09:07:32 2020 -0700

    Remove stale nested backport from InstancePCIRequests

    Sometime in 2015, we removed the hard-coded obj_relationships mapping
    from parent objects which facilitated semi-automated child version
    backports. This was replaced by a manifest-of-versions mechanism
    where the client reports all the supported objects and versions
    during a backport request to conductor. The InstancePCIRequests object
    isn't technically an ObjectListBase, despite acting like one, and thus
    wasn't using the obj_relationships. Because of this, it was doing
    its own backporting of its child object, which was not removed in
    the culling of the static mechanism. Because we now no longer need to
    worry about sub-object backport chaining, when version 1.2 was added,
    no backport rule was added, and since the object does not call the
    base class' generic routine, proper backporting of the child object
    was not happening.

    All we need to do is remove the override to allow the base
    infrastructure to do the work.

    Change-Id: Id610a24c066707de5ddc0507e7ef26c421ba366c
    Closes-Bug: #1868033
    (cherry picked from commit d3ca7356860d64555eef6f5138501cb38f50ecc8)

tags: added: in-stable-ussuri
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/738199

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/train)

Reviewed: https://review.opendev.org/725932
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=38ee1f39423e3af12ddc04de6808ff86bfbca645
Submitter: Zuul
Branch: stable/train

commit 38ee1f39423e3af12ddc04de6808ff86bfbca645
Author: Dan Smith <email address hidden>
Date: Tue Apr 21 09:07:32 2020 -0700

    Remove stale nested backport from InstancePCIRequests

    Sometime in 2015, we removed the hard-coded obj_relationships mapping
    from parent objects which facilitated semi-automated child version
    backports. This was replaced by a manifest-of-versions mechanism
    where the client reports all the supported objects and versions
    during a backport request to conductor. The InstancePCIRequests object
    isn't technically an ObjectListBase, despite acting like one, and thus
    wasn't using the obj_relationships. Because of this, it was doing
    its own backporting of its child object, which was not removed in
    the culling of the static mechanism. Because we now no longer need to
    worry about sub-object backport chaining, when version 1.2 was added,
    no backport rule was added, and since the object does not call the
    base class' generic routine, proper backporting of the child object
    was not happening.

    All we need to do is remove the override to allow the base
    infrastructure to do the work.

    Change-Id: Id610a24c066707de5ddc0507e7ef26c421ba366c
    Closes-Bug: #1868033
    (cherry picked from commit d3ca7356860d64555eef6f5138501cb38f50ecc8)
    (cherry picked from commit e61d0025303b33ef00aa95ebd934f6121d320cbb)

tags: added: in-stable-train
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/stein)

Reviewed: https://review.opendev.org/738199
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=eaecf7e737db3f9e8e21f5d6ba70dd20028bb697
Submitter: Zuul
Branch: stable/stein

commit eaecf7e737db3f9e8e21f5d6ba70dd20028bb697
Author: Dan Smith <email address hidden>
Date: Tue Apr 21 09:07:32 2020 -0700

    Remove stale nested backport from InstancePCIRequests

    Sometime in 2015, we removed the hard-coded obj_relationships mapping
    from parent objects which facilitated semi-automated child version
    backports. This was replaced by a manifest-of-versions mechanism
    where the client reports all the supported objects and versions
    during a backport request to conductor. The InstancePCIRequests object
    isn't technically an ObjectListBase, despite acting like one, and thus
    wasn't using the obj_relationships. Because of this, it was doing
    its own backporting of its child object, which was not removed in
    the culling of the static mechanism. Because we now no longer need to
    worry about sub-object backport chaining, when version 1.2 was added,
    no backport rule was added, and since the object does not call the
    base class' generic routine, proper backporting of the child object
    was not happening.

    All we need to do is remove the override to allow the base
    infrastructure to do the work.

    Change-Id: Id610a24c066707de5ddc0507e7ef26c421ba366c
    Closes-Bug: #1868033
    (cherry picked from commit d3ca7356860d64555eef6f5138501cb38f50ecc8)
    (cherry picked from commit e61d0025303b33ef00aa95ebd934f6121d320cbb)
    (cherry picked from commit 38ee1f39423e3af12ddc04de6808ff86bfbca645)

tags: added: in-stable-stein
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.