[QoS min bw] repeated ERROR log: Unable to save resource provider ... because: re-parenting a provider is not currently allowed
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
neutron |
Fix Released
|
High
|
Bence Romsics |
Bug Description
Description
===========
If neutron is configured with QoS guaranteed minimum bandwidth and the deployment is upgraded from a Stein 14.0.4 or older, or Train 15.0.1 or older to any newer OpenStack versions the following stack trace appears repeatedly in the neutron-server log:
Mar 24 12:12:36 ubuntu neutron-
Mar 24 12:12:36 ubuntu neutron-
Mar 24 12:12:36 ubuntu neutron-
Mar 24 12:12:36 ubuntu neutron-
Mar 24 12:12:36 ubuntu neutron-
Mar 24 12:12:36 ubuntu neutron-
Mar 24 12:12:36 ubuntu neutron-
Mar 24 12:12:36 ubuntu neutron-
Mar 24 12:12:36 ubuntu neutron-
Mar 24 12:12:36 ubuntu neutron-
Mar 24 12:12:36 ubuntu neutron-
Mar 24 12:12:36 ubuntu neutron-
Mar 24 12:12:36 ubuntu neutron-
Mar 24 12:12:36 ubuntu neutron-
Mar 24 12:12:36 ubuntu neutron-
Mar 24 12:12:36 ubuntu neutron-
Mar 24 12:12:36 ubuntu neutron-
Mar 24 12:12:36 ubuntu neutron-
Mar 24 12:12:36 ubuntu neutron-
Mar 24 12:12:36 ubuntu neutron-
Mar 24 12:12:36 ubuntu neutron-
Mar 24 12:12:36 ubuntu neutron-
Mar 24 12:12:36 ubuntu neutron-
Mar 24 12:12:36 ubuntu neutron-
Mar 24 12:12:36 ubuntu neutron-
Mar 24 12:12:36 ubuntu neutron-
Mar 24 12:12:36 ubuntu neutron-
Mar 24 12:12:36 ubuntu neutron-
Mar 24 12:12:36 ubuntu neutron-
Steps to reproduce
==================
1) Deploy neutron Stein 14.0.4 or older or Train 15.0.1 or older
2) Configure minimum guaranteed bandwidth according to [1] E.g define bandwidth inventory in the ovs or sriov agent config with the [ovs] or [sriov_
3) Observe that the agent RP and the device RPs are created
$ openstack --os-placement-
and that the parent of the device RP is the agent RP.
4) Upgrade to a newer OpenStack version
5) Observer that the above periodic error appears in the neutron-server log.
Expected result
===============
No error logs
Actual result
=============
Repeated error logs appear
Triage
======
The problem is caused by the bugfix [2] merged in Ussuri and backported to stable/train and stable/stein.
Before patch[2] the RP tree in placement is created in the following structure:
computeRP
\- agent_1_RP
| \- device_1_RP
| \- device_2_RP
|
\- agent_2_RP
So that the parent of the deviceRP is the agentRP that has the given device configured.
However after patch [2] neutron would like to create a tree where the parent of the deviceRP is the computeRP:
computeRP
\- agent_1_RP
\- agent_2_RP
\- device_1_RP
\- device_2_RP
If the deviceRP already exists under the agentRP before the upgrade then after the upgrade neturon tries to update the parent of the deviceRP to point to the computeRP. However placement API does not allow such re-parenting of the RP. Hence the periodic ERROR message in the neutron-server's log.
If the a new device is added to the ovs and sriov agent config after the upgrade then the neutron-server successfully creates the deviceRP under the computeRP and therefore no repeated ERROR log appears.
Changing the structure of the RP tree was a mistake in [2]. The correct and intended structure is where the deviceRP is under the agentRP.
Fortunately the direct effect of this mistake is limited to:
* repeated ERROR log visible in the neutron server
* neutron retries the placement sync at every agent hearthbeat causing unnecessary load on Placement.
The QoS guaranteed minimum bandwidth feature works properly with both tree structure. Neutron can create new device RPs or can update resource inventory on exising device RPs. Nova and Placement can use both type of tree to schedule VMs with ports having QoS policies and the resource accounting will be correct in Placement.
Proposed solution
=================
The fix is twofold. First we need to restore the proper tree creation logic in neturon. Then provide a way to fix the parent of the deviceRPs created after [2] is applied.
Restore the proper tree creation logic
-------
It will be a simple fix that makes sure that neturon tries to create deviceRPs under the agentRP. This will cause the the repeated ERROR log will dissapeare in deployments that was upgraded from before [2]. However this will cause that the same ERROR log will appeare in deployments that configure new devices after [2] was applied. The fix will enhance the log message to explain the problem.
This fix will be backported to all the affected branches up until stable/stein
Fix the wrongly parented RPs
-------
To re-parent the wrongly parented RPs we need to change Placement to allow re-parenting via the PUT /resource_
Or alternatively we need to provide a script to the cloud admins that is capable of fixing the Placement DB via SQL commands.
[1] https:/
[2] https:/
tags: | added: qos |
description: | updated |
summary: |
- Repeated ERROR log: Unable to save resource provider ... because: re- - parenting a provider is not currently allowed + [QoS min bw] repeated ERROR log: Unable to save resource provider ... + because: re-parenting a provider is not currently allowed |
tags: | added: neutron-proactive-backport-potential |
This SQL will print all the wrongly parented device RPs.
SELECT * resource_ providers provider_ id=root_ provider_ id
FROM placement.
WHERE
(name LIKE '%:NIC Switch agent:%' OR
name LIKE '%:Open vSwitch agent:%') AND
parent_
I don't have enough SQL foo to formulate an UPDATE statement that fixes them. But if somebody can do that then it would be nice to provide that SQL for admins on stable branches having wrongly parented RPs and wanting to fix the tree structure and get rid of the repeated logs and placement load.