Create subnet is failing under high load with OVN

Bug #1890432 reported by Márton Kiss
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
High
Rodolfo Alonso
neutron (Ubuntu)
High
Unassigned
Focal
High
Unassigned
Groovy
High
Unassigned

Bug Description

Under a high concurrency level create subnet is starting to fail. (12-14% failure rate) The bundle is OVN / Ussuri.

neutronclient.common.exceptions.Conflict: Unable to complete operation on subnet This subnet is being modified by another concurrent operation.

Stacktrace: https://pastebin.ubuntu.com/p/sQ5CqD6NyS/
Rally task:

{% set flavor_name = flavor_name or "m1.medium" %}
{% set image_name = image_name or "bionic-kvm" %}

---
  NeutronNetworks.create_and_delete_subnets:
    -
      args:
        network_create_args: {}
        subnet_create_args: {}
        subnet_cidr_start: "1.1.0.0/30"
        subnets_per_network: 2
      runner:
        type: "constant"
        times: 100
        concurrency: 10
      context:
        network: {}
        users:
          tenants: 30
          users_per_tenant: 1
        quotas:
          neutron:
            network: -1
            subnet: -1

Concurrency level set to 1 instead of 10 is not triggering the issue.

Revision history for this message
Márton Kiss (marton-kiss) wrote :
Revision history for this message
Pedro Guimarães (pguimaraes) wrote :

We've done some evaluation on neutron logs, neutron-server-2.log is the most relevant.

We can see the following stacktrace: https://pastebin.canonical.com/p/SMKvPbnnmZ/

The original request for subnet creation is: req-32f521c9-9634-4483-9e50-944033b515ee
Which relates to subnet: 432de530-47b4-4587-b623-ed5b4cbe0c39

However, the stacktrace cites another subnet: 40868773-ded4-4047-9c2e-a0a41cfdb143

Which has been deleted just before the stacktrace:
2020-08-04 14:23:43.121 1528233 DEBUG neutron.db.db_base_plugin_v2 [req-b5c62ce8-2c3b-4893-a103-6cdb8662f5d3 848f111e240b46d682e99836147297b6 2b99334dcd4c487ca49edeb5149c2699 - 7a9fab9568114905b8c864caaa6468e1 7a9fab9568114905b8c864caaa6468e1] Deleting subnet 40868773-ded4-4047-9c2e-a0a41cfdb143 delete_subnet /usr/lib/python3/dist-packages/neutron/db/db_base_plugin_v2.py:1042

Here is the stack with the delete just before:
https://pastebin.canonical.com/p/yDQsxb8ktf/

Looking at the driver code, I can see: https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py#L2096

Which, at one subnet creation, grabs all the subnets for the same network and checks one by one for metadata port. However, if one subnet has been set for deletion just before, that can cause errors from concurrency.

The commit for that line is pretty old, it dates from original repo: openstack/networking-ovn.
Here: https://github.com/openstack/networking-ovn/commit/ad1fea3e7ba837d521c924b285294d7cf2efcefb

Would that be possible to tweak "update_metadata_port" method to only look for metadata port for the subnet that has been created? Otherwise, we need to assume that not all the other subnets are available for edit at the moment of the request.

In my view, editing all subnets at once only makes sense if we had a config change to enable DHCP or metadata services and we need to set it for them all. That can be done at neutron-server restart instead.

Revision history for this message
Pedro Guimarães (pguimaraes) wrote :

Rally will probably always trigger that concurrency at create_delete subnet scenario.

Revision history for this message
Frode Nordahl (fnordahl) wrote :

Thank you for the bug and log artifacts.

FWIW, the create_and_delete_subnets rally task is a part of the upstream Neutron gate [0] and it is passing there [1]. However I am not sure how the gate deployment is laid out, and it may very well be it differs both in scale and topology compared to how we set things up.

I'll take a look.

0: https://github.com/openstack/neutron/blob/master/rally-jobs/task-neutron.yaml
1: https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_8f7/744407/1/check/neutron-ovn-rally-task/8f7e88a/

Changed in charm-neutron-api:
status: New → Triaged
importance: Undecided → High
assignee: nobody → Frode Nordahl (fnordahl)
Revision history for this message
Frode Nordahl (fnordahl) wrote :

Adding upstream Neutron project to this LP.

The lock contention arises from the update of the metadata_port:
https://github.com/openstack/neutron/blob/24590a334fff0ed1cb513b0f496be965bc9309d4/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py#L2111
https://github.com/openstack/neutron/blob/24590a334fff0ed1cb513b0f496be965bc9309d4/neutron/db/ipam_backend_mixin.py#L653-L680

Updating the fixed_ips field of the metadata_port will make Neutron attempt to lock all the subnets involved regardless of update_metadata_port only changing one subnet or all of them.

I wonder how the OVS driver dealt with this, as it would have the exact same issue.

Perhaps the only option is for Neutron to gracefully ignore a update_metadata_port failure at subnet creation and update the metadata port at a later time in one of its maintenance jobs.

Changed in charm-neutron-api:
status: Triaged → Invalid
importance: High → Undecided
assignee: Frode Nordahl (fnordahl) → nobody
Changed in neutron:
status: New → In Progress
assignee: nobody → Frode Nordahl (fnordahl)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/745082

Revision history for this message
Márton Kiss (marton-kiss) wrote :

After applying this patch for neutron-server, I can still see the same errors in Rally (neutronclient.common.exceptions.Conflict: Unable to complete operation on subnet This subnet is being modified by another concurrent operation.).

The additional metadata port log entries are showing up:
$ grep "Unable to update" /var/log/neutron/neutron-server.log
2020-08-06 16:18:01.184 1861427 WARNING neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.ovn_client [req-14b20508-e8af-4cbe-ac26-01b7db79185d c11bb69491d64f77acbd0fa5261dded0 09864d4c1ece4ace967b58aea383bf13 - 7a9fab9568114905b8c864caaa6468e1 7a9fab9568114905b8c864caaa6468e1] Unable to update metadata port on network b5cdfe05-f6dc-408d-a738-44ddf2438af0 for subnet cebbf10a-4ecb-4754-833b-6911c99f1548. The update will be performed in a maintenance job.: neutron_lib.exceptions.SubnetNotFound: Subnet e664178c-3875-4db8-95a5-91aa5637d894 could not be found.
2020-08-06 16:18:09.887 1861432 WARNING neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.ovn_client [req-c3e7193a-8626-43e7-903b-214a250ed878 a890ac112c874f5ba9062aee95586e3a bb6d36aabfe24e9399a5faae5ed02743 - 7a9fab9568114905b8c864caaa6468e1 7a9fab9568114905b8c864caaa6468e1] Unable to update metadata port on network 469f44ca-ef25-4995-8a41-b85568cbd5af for subnet 32535c97-26ec-470e-b533-ba6d53f9ef04. The update will be performed in a maintenance job.: neutron_lib.exceptions.SubnetNotFound: Subnet 8b3724e1-77e0-4993-a293-94b2b9db8cec could not be found.

The relevant part of rally output: https://pastebin.ubuntu.com/p/czdgj994X7/

tags: added: ovn
Changed in neutron:
importance: Undecided → High
Revision history for this message
Frode Nordahl (fnordahl) wrote :

I forgot to catch both SubnetNotFound and SubnetInUse in the first patch set which of course would make the problem still occur. I have updated the patch to catch both, would you be able to try again and see if it makes a difference?

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hello:

I don't have access to any trace. What I would like to know is what operation is failing: the subnet deletion or the subnet creation/update.

Regards.

Revision history for this message
Frode Nordahl (fnordahl) wrote :
Download full text (8.8 KiB)

Hello Rodolfo, the error occurs on create subnet, I will include an excerpt here and I have asked the reporter to publish what he can:

2020-08-04 14:23:47.914 1220597 ERROR neutron.plugins.ml2.managers [req-9dd64374-b23d-4296-b0f4-98b3cab64490 8a40f8aceae640a2ae10510992073b05 645ffd60a1054fab8359e3afe7244572 - 7a9fab9568114905b8c864caaa6468e1 7a9fab9568114905b8c864caaa6468e1] Mechanism driver 'ovn' failed in create_subnet_postcommit: neutron_lib.exceptions.SubnetInUse: Unable to complete operation on subnet 34fd86b6-5030-4e58-b559-964088afdc1d: This subnet is being modified by another concurrent operation.
2020-08-04 14:23:47.914 1220597 ERROR neutron.plugins.ml2.managers Traceback (most recent call last):
2020-08-04 14:23:47.914 1220597 ERROR neutron.plugins.ml2.managers File "/usr/lib/python3/dist-packages/neutron/plugins/ml2/managers.py", line 477, in _call_on_drivers
2020-08-04 14:23:47.914 1220597 ERROR neutron.plugins.ml2.managers getattr(driver.obj, method_name)(context)
2020-08-04 14:23:47.914 1220597 ERROR neutron.plugins.ml2.managers File "/usr/lib/python3/dist-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/mech_driver.py", line 428, in create_subnet_postcommit
2020-08-04 14:23:47.914 1220597 ERROR neutron.plugins.ml2.managers context.network.current)
2020-08-04 14:23:47.914 1220597 ERROR neutron.plugins.ml2.managers File "/usr/lib/python3/dist-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py", line 2017, in create_subnet
2020-08-04 14:23:47.914 1220597 ERROR neutron.plugins.ml2.managers self.update_metadata_port(context, network['id'])
2020-08-04 14:23:47.914 1220597 ERROR neutron.plugins.ml2.managers File "/usr/lib/python3/dist-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py", line 2206, in update_metadata_port
2020-08-04 14:23:47.914 1220597 ERROR neutron.plugins.ml2.managers metadata_port['id'], port)
2020-08-04 14:23:47.914 1220597 ERROR neutron.plugins.ml2.managers File "/usr/lib/python3/dist-packages/neutron/common/utils.py", line 687, in inner
2020-08-04 14:23:47.914 1220597 ERROR neutron.plugins.ml2.managers return f(self, context, *args, **kwargs)
2020-08-04 14:23:47.914 1220597 ERROR neutron.plugins.ml2.managers File "/usr/lib/python3/dist-packages/neutron_lib/db/api.py", line 233, in wrapped
2020-08-04 14:23:47.914 1220597 ERROR neutron.plugins.ml2.managers return method(*args, **kwargs)
2020-08-04 14:23:47.914 1220597 ERROR neutron.plugins.ml2.managers File "/usr/lib/python3/dist-packages/neutron_lib/db/api.py", line 139, in wrapped
2020-08-04 14:23:47.914 1220597 ERROR neutron.plugins.ml2.managers setattr(e, '_RETRY_EXCEEDED', True)
2020-08-04 14:23:47.914 1220597 ERROR neutron.plugins.ml2.managers File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
2020-08-04 14:23:47.914 1220597 ERROR neutron.plugins.ml2.managers self.force_reraise()
2020-08-04 14:23:47.914 1220597 ERROR neutron.plugins.ml2.managers File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
2020-08-04 14:23:47.914 1220597 ERROR neutron.plugins.ml2.managers six.reraise(self.type_, s...

Read more...

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Ok, I talked to one of the OVN mantainers and he confirmed this point: when "update_metadata_port" is called from update/create subnet, there is no need to update the metadata port in all subnets. That will be done in each subnet call.

Only when this method is called from the DB sync process, can update the metadata port globally.

I'll push a patch today.

Regards.

Revision history for this message
Márton Kiss (marton-kiss) wrote :

I did two test-runs with the patch#2, and no more conflict errors are showing up, the rally is finishing with a 100% result:

+-------------------------------------------------------------------------------------------------------------------------------+
| Response Times (sec) |
+----------------------------+-----------+--------------+--------------+--------------+-----------+-----------+---------+-------+
| Action | Min (sec) | Median (sec) | 90%ile (sec) | 95%ile (sec) | Max (sec) | Avg (sec) | Success | Count |
+----------------------------+-----------+--------------+--------------+--------------+-----------+-----------+---------+-------+
| neutron.create_subnet (x2) | 4.274 | 7.567 | 14.682 | 18.878 | 21.743 | 9.178 | 100.0% | 100 |
| neutron.delete_subnet (x2) | 1.017 | 1.768 | 3.602 | 4.301 | 7.101 | 2.238 | 100.0% | 100 |
| total | 6.058 | 9.218 | 18.286 | 22.765 | 26.168 | 11.417 | 100.0% | 100 |
| -> duration | 6.058 | 9.218 | 18.286 | 22.765 | 26.168 | 11.417 | 100.0% | 100 |
| -> idle_duration | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 100.0% | 100 |
+----------------------------+-----------+--------------+--------------+--------------+-----------+-----------+---------+-------+

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.opendev.org/745330

Changed in neutron:
assignee: Frode Nordahl (fnordahl) → Rodolfo Alonso (rodolfo-alonso-hernandez)
Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hello:

Could you try this patch ^^^?

Regards.

Revision history for this message
Márton Kiss (marton-kiss) wrote :

Rodolfo, you can find the log here: https://drive.google.com/file/d/1rVD4PdUcLLtL_jCOBM3dR9EOHiG-xE9K/view?usp=sharing

neutronclient.common.exceptions.Conflict: Unable to complete operation on subnet f0f776ca-f6e7-4007-b592-fcf0de9fb933: This subnet is being modified by another concurrent operation.
Neutron server returns request_ids: ['req-edc1b690-62b1-4741-a8ed-29e99646f7b2']

+-------------------------------------------------------------------------------------------------------------------------------+
| Response Times (sec) |
+----------------------------+-----------+--------------+--------------+--------------+-----------+-----------+---------+-------+
| Action | Min (sec) | Median (sec) | 90%ile (sec) | 95%ile (sec) | Max (sec) | Avg (sec) | Success | Count |
+----------------------------+-----------+--------------+--------------+--------------+-----------+-----------+---------+-------+
| neutron.create_subnet (x2) | 5.244 | 8.707 | 12.604 | 14.067 | 19.092 | 9.152 | 96.9% | 97 |
| neutron.delete_subnet (x2) | 1.003 | 2.164 | 3.253 | 3.666 | 5.524 | 2.27 | 100.0% | 94 |
| neutron.create_subnet | 2.671 | 3.481 | 4.254 | 4.351 | 4.447 | 3.533 | 0.0% | 3 |
| total | 2.671 | 10.058 | 15.676 | 16.734 | 23.823 | 11.118 | 94.0% | 100 | | -> duration | 2.671 | 10.058 | 15.676 | 16.734 | 23.823 | 11.118 | 94.0% | 100 |
| -> idle_duration | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 94.0% | 100 |
+----------------------------+-----------+--------------+--------------+--------------+-----------+-----------+---------+-------+

Revision history for this message
Márton Kiss (marton-kiss) wrote :
Changed in neutron (Ubuntu):
status: New → Triaged
importance: Undecided → High
Changed in neutron (Ubuntu Focal):
status: New → Triaged
importance: Undecided → High
Changed in neutron:
assignee: Rodolfo Alonso (rodolfo-alonso-hernandez) → Frode Nordahl (fnordahl)
Changed in neutron:
assignee: Frode Nordahl (fnordahl) → Rodolfo Alonso (rodolfo-alonso-hernandez)
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

sub'd to field high, this is affecting sqa release testing.

tags: added: neutron-proactive-backport-potential
James Page (james-page)
Changed in neutron:
status: In Progress → Fix Committed
Changed in neutron (Ubuntu):
status: Triaged → Invalid
status: Invalid → Fix Released
Revision history for this message
James Page (james-page) wrote :

Hirsute/Wallaby packages include the fix from:

https://review.opendev.org/c/openstack/neutron/+/745330/

So marked "Fix Released" for this target.

Focal/Ussuri and Groovy/Wallaby - fix has been merged into the neutron table branch for each release however no new point releases from Neutron for these two release targets yet.

Revision history for this message
James Page (james-page) wrote :
no longer affects: charm-neutron-api
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 16.3.1

This issue was fixed in the openstack/neutron 16.3.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 17.1.1

This issue was fixed in the openstack/neutron 17.1.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 18.0.0.0rc1

This issue was fixed in the openstack/neutron 18.0.0.0rc1 release candidate.

Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

Given that the three point releases mentioned are now included in Ubuntu / the Cloud Archive, I'm marking this fix-released for all affected targets.

Focal-updates: 2:16.3.2
Groovy-updates: 2:17.1.1
Hirsute: 2:18.0.0

Changed in neutron (Ubuntu Focal):
status: Triaged → Fix Released
Changed in neutron (Ubuntu Groovy):
status: Triaged → Fix Released
Revision history for this message
Akihiro Motoki (amotoki) wrote :

The status in neutron was not updated somehow. The fix https://review.opendev.org/c/openstack/neutron/+/745330/ landed during Wallaby development.

Changed in neutron:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers