[QoS min bw] repeated ERROR log: Unable to save resource provider ... because: re-parenting a provider is not currently allowed

Bug #1921150 reported by Balazs Gibizer
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Bence Romsics

Bug Description

Description
===========
If neutron is configured with QoS guaranteed minimum bandwidth and the deployment is upgraded from a Stein 14.0.4 or older, or Train 15.0.1 or older to any newer OpenStack versions the following stack trace appears repeatedly in the neutron-server log:

Mar 24 12:12:36 ubuntu neutron-server[4499]: ERROR neutron.services.placement_report.plugin Traceback (most recent call last):
Mar 24 12:12:36 ubuntu neutron-server[4499]: ERROR neutron.services.placement_report.plugin File "/opt/stack/neutron-lib/neutron_lib/placement/client.py", line 53, in wrapper
Mar 24 12:12:36 ubuntu neutron-server[4499]: ERROR neutron.services.placement_report.plugin return f(self, *a, **k)
Mar 24 12:12:36 ubuntu neutron-server[4499]: ERROR neutron.services.placement_report.plugin File "/opt/stack/neutron-lib/neutron_lib/placement/client.py", line 232, in update_resource_provider
Mar 24 12:12:36 ubuntu neutron-server[4499]: ERROR neutron.services.placement_report.plugin return self._put(url, update_body).json()
Mar 24 12:12:36 ubuntu neutron-server[4499]: ERROR neutron.services.placement_report.plugin File "/opt/stack/neutron-lib/neutron_lib/placement/client.py", line 188, in _put
Mar 24 12:12:36 ubuntu neutron-server[4499]: ERROR neutron.services.placement_report.plugin endpoint_filter=self._ks_filter, **kwargs)
Mar 24 12:12:36 ubuntu neutron-server[4499]: ERROR neutron.services.placement_report.plugin File "/usr/local/lib/python3.6/dist-packages/keystoneauth1/session.py", line 1114, in put
Mar 24 12:12:36 ubuntu neutron-server[4499]: ERROR neutron.services.placement_report.plugin return self.request(url, 'PUT', **kwargs)
Mar 24 12:12:36 ubuntu neutron-server[4499]: ERROR neutron.services.placement_report.plugin File "/usr/local/lib/python3.6/dist-packages/keystoneauth1/session.py", line 943, in request
Mar 24 12:12:36 ubuntu neutron-server[4499]: ERROR neutron.services.placement_report.plugin raise exceptions.from_response(resp, method, url)
Mar 24 12:12:36 ubuntu neutron-server[4499]: ERROR neutron.services.placement_report.plugin keystoneauth1.exceptions.http.BadRequest: Bad Request (HTTP 400) (Request-ID: req-31ef5696-dc60-4478-939b-a12d3d3bdf65)
Mar 24 12:12:36 ubuntu neutron-server[4499]: ERROR neutron.services.placement_report.plugin
Mar 24 12:12:36 ubuntu neutron-server[4499]: ERROR neutron.services.placement_report.plugin During handling of the above exception, another exception occurred:
Mar 24 12:12:36 ubuntu neutron-server[4499]: ERROR neutron.services.placement_report.plugin
Mar 24 12:12:36 ubuntu neutron-server[4499]: ERROR neutron.services.placement_report.plugin Traceback (most recent call last):
Mar 24 12:12:36 ubuntu neutron-server[4499]: ERROR neutron.services.placement_report.plugin File "/opt/stack/neutron/neutron/services/placement_report/plugin.py", line 163, in batch
Mar 24 12:12:36 ubuntu neutron-server[4499]: ERROR neutron.services.placement_report.plugin deferred.execute()
Mar 24 12:12:36 ubuntu neutron-server[4499]: ERROR neutron.services.placement_report.plugin File "/opt/stack/neutron/neutron/agent/common/placement_report.py", line 43, in execute
Mar 24 12:12:36 ubuntu neutron-server[4499]: ERROR neutron.services.placement_report.plugin return self.func(*self.args, **self.kwargs)
Mar 24 12:12:36 ubuntu neutron-server[4499]: ERROR neutron.services.placement_report.plugin File "/opt/stack/neutron-lib/neutron_lib/placement/client.py", line 53, in wrapper
Mar 24 12:12:36 ubuntu neutron-server[4499]: ERROR neutron.services.placement_report.plugin return f(self, *a, **k)
Mar 24 12:12:36 ubuntu neutron-server[4499]: ERROR neutron.services.placement_report.plugin File "/opt/stack/neutron-lib/neutron_lib/placement/client.py", line 254, in ensure_resource_provider
Mar 24 12:12:36 ubuntu neutron-server[4499]: ERROR neutron.services.placement_report.plugin resource_provider=resource_provider)
Mar 24 12:12:36 ubuntu neutron-server[4499]: ERROR neutron.services.placement_report.plugin File "/opt/stack/neutron-lib/neutron_lib/placement/client.py", line 62, in wrapper
Mar 24 12:12:36 ubuntu neutron-server[4499]: ERROR neutron.services.placement_report.plugin msg=exc.response.text.replace('\n', ' '))
Mar 24 12:12:36 ubuntu neutron-server[4499]: ERROR neutron.services.placement_report.plugin neutron_lib.exceptions.placement.PlacementClientError: Placement Client Error (4xx): {"errors": [{"status": 400, "title": "Bad Request", "detail": "The server could not comply with the request since it is either malformed or otherwise incorrect.\n\n Unable to save resource provider af0bc0aa-525e-563f-bb4d-2f26f70371d6: Object action update failed because: re-parenting a provider is not currently allowed. ", "request_id": "req-31ef5696-dc60-4478-939b-a12d3d3bdf65"}]}
Mar 24 12:12:36 ubuntu neutron-server[4499]: ERROR neutron.services.placement_report.plugin
Mar 24 12:12:36 ubuntu neutron-server[4499]: WARNING neutron.services.placement_report.plugin [-] Synchronization of resources of agent type Open vSwitch agent at host ubuntu to placement failed.

Steps to reproduce
==================
1) Deploy neutron Stein 14.0.4 or older or Train 15.0.1 or older
2) Configure minimum guaranteed bandwidth according to [1] E.g define bandwidth inventory in the ovs or sriov agent config with the [ovs] or [sriov_nic]/resource_provider_bandwidths config option
3) Observe that the agent RP and the device RPs are created

    $ openstack --os-placement-api-version 1.14 resource provider tree list

and that the parent of the device RP is the agent RP.

4) Upgrade to a newer OpenStack version
5) Observer that the above periodic error appears in the neutron-server log.

Expected result
===============

No error logs

Actual result
=============

Repeated error logs appear

Triage
======

The problem is caused by the bugfix [2] merged in Ussuri and backported to stable/train and stable/stein.

Before patch[2] the RP tree in placement is created in the following structure:

  computeRP
    \- agent_1_RP
    | \- device_1_RP
    | \- device_2_RP
    |
    \- agent_2_RP

So that the parent of the deviceRP is the agentRP that has the given device configured.

However after patch [2] neutron would like to create a tree where the parent of the deviceRP is the computeRP:

  computeRP
    \- agent_1_RP
    \- agent_2_RP
    \- device_1_RP
    \- device_2_RP

If the deviceRP already exists under the agentRP before the upgrade then after the upgrade neturon tries to update the parent of the deviceRP to point to the computeRP. However placement API does not allow such re-parenting of the RP. Hence the periodic ERROR message in the neutron-server's log.

If the a new device is added to the ovs and sriov agent config after the upgrade then the neutron-server successfully creates the deviceRP under the computeRP and therefore no repeated ERROR log appears.

Changing the structure of the RP tree was a mistake in [2]. The correct and intended structure is where the deviceRP is under the agentRP.

Fortunately the direct effect of this mistake is limited to:
* repeated ERROR log visible in the neutron server
* neutron retries the placement sync at every agent hearthbeat causing unnecessary load on Placement.

The QoS guaranteed minimum bandwidth feature works properly with both tree structure. Neutron can create new device RPs or can update resource inventory on exising device RPs. Nova and Placement can use both type of tree to schedule VMs with ports having QoS policies and the resource accounting will be correct in Placement.

Proposed solution
=================

The fix is twofold. First we need to restore the proper tree creation logic in neturon. Then provide a way to fix the parent of the deviceRPs created after [2] is applied.

Restore the proper tree creation logic
--------------------------------------

It will be a simple fix that makes sure that neturon tries to create deviceRPs under the agentRP. This will cause the the repeated ERROR log will dissapeare in deployments that was upgraded from before [2]. However this will cause that the same ERROR log will appeare in deployments that configure new devices after [2] was applied. The fix will enhance the log message to explain the problem.

This fix will be backported to all the affected branches up until stable/stein

Fix the wrongly parented RPs
----------------------------

To re-parent the wrongly parented RPs we need to change Placement to allow re-parenting via the PUT /resource_providers/{provider_uuid} API.

Or alternatively we need to provide a script to the cloud admins that is capable of fixing the Placement DB via SQL commands.

[1] https://docs.openstack.org/neutron/latest/admin/config-qos-min-bw.html
[2] https://review.opendev.org/q/I9b08a3a9c20b702b745b41d4885fb5120fd665ce

tags: added: qos
description: updated
summary: - Repeated ERROR log: Unable to save resource provider ... because: re-
- parenting a provider is not currently allowed
+ [QoS min bw] repeated ERROR log: Unable to save resource provider ...
+ because: re-parenting a provider is not currently allowed
Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

This SQL will print all the wrongly parented device RPs.

SELECT *
FROM placement.resource_providers
WHERE
  (name LIKE '%:NIC Switch agent:%' OR
   name LIKE '%:Open vSwitch agent:%') AND
  parent_provider_id=root_provider_id

I don't have enough SQL foo to formulate an UPDATE statement that fixes them. But if somebody can do that then it would be nice to provide that SQL for admins on stable branches having wrongly parented RPs and wanting to fix the tree structure and get rid of the repeated logs and placement load.

Revision history for this message
Bence Romsics (bence-romsics) wrote :
Changed in neutron:
assignee: nobody → Bence Romsics (bence-romsics)
status: New → Triaged
importance: Undecided → High
Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

OK, I think I managed to create an SQL script that re-parents the deviceRPs to be under the agentRP. Admins can use this script to clean up _after_ the fix for bug 1921150 is applied to neutron.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/782553
Committed: https://opendev.org/openstack/neutron/commit/7f35e4e857f7c6e83c635125ce9b42df6e10a510
Submitter: "Zuul (22348)"
Branch: master

commit 7f35e4e857f7c6e83c635125ce9b42df6e10a510
Author: Bence Romsics <email address hidden>
Date: Tue Mar 23 14:07:36 2021 +0100

    Physical NIC RP should be child of agent RP

    In the fix for #1853840 I made a mistake and since then we created
    the physical NIC resource providers as a child of the hypervisor
    resource provider instead of the agent resource provider. Here:

    https://review.opendev.org/c/openstack/neutron/+/696600/3/neutron/agent/common/placement_report.py#159

    This *did not* break the minimum bandwidth aware scheduling.
    But still there are multiple problems:

    1) If you created your physical NIC RPs before the fix for #1853840
       but upgraded to after the fix for #1853840, then resource syncs
       will throw an error in neutron-server at each physical NIC RP
       update. That pollutes the logs and wastes some resources since
       the prohibited update will be forever retried.

    2) If you created your physical NIC RPs after the fix for #1853840
       then your physical NIC RPs have the wrong parent. Which again
       does not break minimum bandwidth aware scheduling. But it may pose
       problems for later features wanting to build on the originally
       planned RP tree structure.

    3) Cleanup of decommissioned RPs is a bit different than expected.
       This cleanup was always left to the admin, so it only affects a
       manual process.

    The proper RP structure was and should be the following:

    The hypervisor RP(s) must be the root(s).
    As a child of each hypervisor RP, there should be an agent RP.
    The physical NIC RPs should be the children of the agent RPs.

    Unfortunately at the moment the Placement API generically prohibits
    update of the parent resource provider id in a PUT request:

    https://docs.openstack.org/api-ref/placement/?expanded=update-resource-provider-detail#update-resource-provider

    Therefore without a later Placement change we cannot fix the RPs
    already created with the wrong parent. However we can fix the RPs
    to be created later. We do that here. We also fix a bug in the unit
    tests that allowed the wrong parent to pass unnoticed. Plus we
    add an extra log message to direct the user seeing the pollution
    in the logs to the proper bug report.

    There may be a follow up patch later, because not all RP re-parenting
    operations are problematic, therefore we are thinking of relaxing
    this blanket prohibition in Placement. When Placement allows updates
    to the parent id we can fix RPs already created with the wrong parent
    too.

    Change-Id: I7caa8827d22103600ca685a58294640fc831dbd9
    Closes-Bug: #1921150
    Co-Authored-By: "Balazs Gibizer" <email address hidden>
    Related-Bug: #1853840

Changed in neutron:
status: Triaged → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/wallaby)

Fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/neutron/+/789674

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/789674
Committed: https://opendev.org/openstack/neutron/commit/d3be39433cb43bcaceb36a04d2accd6ff9a3aa8b
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit d3be39433cb43bcaceb36a04d2accd6ff9a3aa8b
Author: Bence Romsics <email address hidden>
Date: Tue Mar 23 14:07:36 2021 +0100

    Physical NIC RP should be child of agent RP

    In the fix for #1853840 I made a mistake and since then we created
    the physical NIC resource providers as a child of the hypervisor
    resource provider instead of the agent resource provider. Here:

    https://review.opendev.org/c/openstack/neutron/+/696600/3/neutron/agent/common/placement_report.py#159

    This *did not* break the minimum bandwidth aware scheduling.
    But still there are multiple problems:

    1) If you created your physical NIC RPs before the fix for #1853840
       but upgraded to after the fix for #1853840, then resource syncs
       will throw an error in neutron-server at each physical NIC RP
       update. That pollutes the logs and wastes some resources since
       the prohibited update will be forever retried.

    2) If you created your physical NIC RPs after the fix for #1853840
       then your physical NIC RPs have the wrong parent. Which again
       does not break minimum bandwidth aware scheduling. But it may pose
       problems for later features wanting to build on the originally
       planned RP tree structure.

    3) Cleanup of decommissioned RPs is a bit different than expected.
       This cleanup was always left to the admin, so it only affects a
       manual process.

    The proper RP structure was and should be the following:

    The hypervisor RP(s) must be the root(s).
    As a child of each hypervisor RP, there should be an agent RP.
    The physical NIC RPs should be the children of the agent RPs.

    Unfortunately at the moment the Placement API generically prohibits
    update of the parent resource provider id in a PUT request:

    https://docs.openstack.org/api-ref/placement/?expanded=update-resource-provider-detail#update-resource-provider

    Therefore without a later Placement change we cannot fix the RPs
    already created with the wrong parent. However we can fix the RPs
    to be created later. We do that here. We also fix a bug in the unit
    tests that allowed the wrong parent to pass unnoticed. Plus we
    add an extra log message to direct the user seeing the pollution
    in the logs to the proper bug report.

    There may be a follow up patch later, because not all RP re-parenting
    operations are problematic, therefore we are thinking of relaxing
    this blanket prohibition in Placement. When Placement allows updates
    to the parent id we can fix RPs already created with the wrong parent
    too.

    Change-Id: I7caa8827d22103600ca685a58294640fc831dbd9
    Closes-Bug: #1921150
    Co-Authored-By: "Balazs Gibizer" <email address hidden>
    Related-Bug: #1853840
    (cherry picked from commit 7f35e4e857f7c6e83c635125ce9b42df6e10a510)

tags: added: in-stable-wallaby
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/victoria)

Fix proposed to branch: stable/victoria
Review: https://review.opendev.org/c/openstack/neutron/+/790270

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/victoria)
Download full text (3.1 KiB)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/790270
Committed: https://opendev.org/openstack/neutron/commit/11904b20ad6ce17904f2a685438d7985e32e2cd7
Submitter: "Zuul (22348)"
Branch: stable/victoria

commit 11904b20ad6ce17904f2a685438d7985e32e2cd7
Author: Bence Romsics <email address hidden>
Date: Tue Mar 23 14:07:36 2021 +0100

    Physical NIC RP should be child of agent RP

    In the fix for #1853840 I made a mistake and since then we created
    the physical NIC resource providers as a child of the hypervisor
    resource provider instead of the agent resource provider. Here:

    https://review.opendev.org/c/openstack/neutron/+/696600/3/neutron/agent/common/placement_report.py#159

    This *did not* break the minimum bandwidth aware scheduling.
    But still there are multiple problems:

    1) If you created your physical NIC RPs before the fix for #1853840
       but upgraded to after the fix for #1853840, then resource syncs
       will throw an error in neutron-server at each physical NIC RP
       update. That pollutes the logs and wastes some resources since
       the prohibited update will be forever retried.

    2) If you created your physical NIC RPs after the fix for #1853840
       then your physical NIC RPs have the wrong parent. Which again
       does not break minimum bandwidth aware scheduling. But it may pose
       problems for later features wanting to build on the originally
       planned RP tree structure.

    3) Cleanup of decommissioned RPs is a bit different than expected.
       This cleanup was always left to the admin, so it only affects a
       manual process.

    The proper RP structure was and should be the following:

    The hypervisor RP(s) must be the root(s).
    As a child of each hypervisor RP, there should be an agent RP.
    The physical NIC RPs should be the children of the agent RPs.

    Unfortunately at the moment the Placement API generically prohibits
    update of the parent resource provider id in a PUT request:

    https://docs.openstack.org/api-ref/placement/?expanded=update-resource-provider-detail#update-resource-provider

    Therefore without a later Placement change we cannot fix the RPs
    already created with the wrong parent. However we can fix the RPs
    to be created later. We do that here. We also fix a bug in the unit
    tests that allowed the wrong parent to pass unnoticed. Plus we
    add an extra log message to direct the user seeing the pollution
    in the logs to the proper bug report.

    There may be a follow up patch later, because not all RP re-parenting
    operations are problematic, therefore we are thinking of relaxing
    this blanket prohibition in Placement. When Placement allows updates
    to the parent id we can fix RPs already created with the wrong parent
    too.

    Change-Id: I7caa8827d22103600ca685a58294640fc831dbd9
    Closes-Bug: #1921150
    Co-Authored-By: "Balazs Gibizer" <email address hidden>
    Related-Bug: #1853840
    (cherry picked from commit 7f35e4e857f7c6e83c635125ce9b42df6e10a510)
    (cherry picked from commit d3be39433cb43bcaceb36a04d2accd6ff9a...

Read more...

tags: added: in-stable-victoria
tags: added: neutron-proactive-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 17.2.0

This issue was fixed in the openstack/neutron 17.2.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 18.1.0

This issue was fixed in the openstack/neutron 18.1.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron-lib (master)

Reviewed: https://review.opendev.org/c/openstack/neutron-lib/+/785337
Committed: https://opendev.org/openstack/neutron-lib/commit/270184e936352c07c8325d88584a6d25d0a4c8cc
Submitter: "Zuul (22348)"
Branch: master

commit 270184e936352c07c8325d88584a6d25d0a4c8cc
Author: Bence Romsics <email address hidden>
Date: Wed Apr 7 13:35:18 2021 +0200

    Use placement version allowing re-parenting RP update

    That is microversion 1.37.

    The next time a placement re-sync is triggered (for example by
    restarting the respective agents) this corrects the parents
    of wrongly created resource providers introduced by bug #1921150.

    Change-Id: I6b54aa9c21bf28de1d451c195e37efde6110258a
    Depends-On: https://review.opendev.org/c/openstack/placement/+/784020
    Related-Bug: #1921150

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 19.0.0.0rc1

This issue was fixed in the openstack/neutron 19.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.