Neutron API is not available when a specific ovn-central unit is down

Bug #1991239 reported by Przemyslaw Hausman
This bug report is a duplicate of:  Bug #1985062: ovsdbapp ssl send socket error. Edit Remove
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Neutron API Charm
New
Undecided
Unassigned

Bug Description

Neutron API is not available when a specific ovn-central unit is down. Running any network-related CLI commands fails as follows:

```
$ openstack network list
HttpException: 503: Server Error for url: https://neutron.orange.box:9696/v2.0/networks, 503 Service Unavailable: problems. Please try again later.: Apache/2.4.41 (Ubuntu) Server at neutron.orange.box Port 9696: Service Unavailable: The server is temporarily unable to service your: request due to maintenance downtime or capacity
```

Affected OpenStack release: Focal-Yoga.
The problem is not reproducible on Focal-Ussuri.

Steps to reproduce:

1. Find out what is the first IP address of the ovn-central unit that you are going to shut down in the next step. Important: this must be a first IP address in the list.
```
$ juju ssh neutron-api/0 sudo grep ovn_nb_connection /etc/neutron/plugins/ml2/ml2_conf.ini
ovn_nb_connection = ssl:172.27.81.187:6641,ssl:172.27.81.172:6641,ssl:172.27.81.209:6641
```
In the above example, the first IP is 172.27.81.187.

2. Shut down the ovn-central unit that holds the IP from the previous step.

3. Restart neutron-server service on neutron-api unit
```
juju ssh neutron-api/0 'sudo systemctl restart neutron-server'
```

4. Access Neutron API and confirm it is not available now. This is not an expected outcome. The expected behavior is Neutron API still being available.

```
$ openstack network list
HttpException: 503: Server Error for url: https://neutron.orange.box:9696/v2.0/networks, 503 Service Unavailable: problems. Please try again later.: Apache/2.4.41 (Ubuntu) Server at neutron.orange.box Port 9696: Service Unavailable: The server is temporarily unable to service your: request due to maintenance downtime or capacity
```

As a validation that the problem is only showing up when the first of the ovn-central units listed in ml2_conf.ini is down, run the following steps:

1. Move the IP address of downed ovn-central unit to the end of the list in ml2_conf.ini on neutron-api unit, for example:

BEFORE:
ovn_nb_connection = ssl:172.27.81.209:16642,ssl:172.27.81.187:16642,ssl:172.27.81.172:16642
[...]
ovn_sb_connection = ssl:172.27.81.209:16642,ssl:172.27.81.187:16642,ssl:172.27.81.172:16642

AFTER:
ovn_nb_connection = ssl:172.27.81.187:16642,ssl:172.27.81.172:16642,ssl:172.27.81.209:16642
[...]
ovn_sb_connection = ssl:172.27.81.187:16642,ssl:172.27.81.172:16642,ssl:172.27.81.209:16642

Then restart neutron-server service on the neutron-api unit. The problem should now go away and the Neutron API is now available.

Attached please see the neutron-server.log from the neutron-api unit. Look for the message "Unrecoverable error: please check log for details.: ValueError: non-zero flags not allowed in calls to send() on <class 'eventlet.green.ssl.GreenSSLSocket'>".

Revision history for this message
Przemyslaw Hausman (phausman) wrote :
Revision history for this message
Przemyslaw Hausman (phausman) wrote :

I'm subscribing field-critical as this problem is encountered on a customer deployment. Customer shuts down random control node (one of three) and expects that the OpenStack API is still available. The issue is blocking the handover of the cloud.

Revision history for this message
Frode Nordahl (fnordahl) wrote :

This sounds like bug 1985062

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

For context: Neutron should rely on the `ovsdb_probe_interval` setting at the client side (which is set to 60 seconds by default) for detecting DB failovers:

https://github.com/openstack/neutron/blob/e4cc40f114aed485a62ed8535813d6ee610ce41f/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovsdb_monitor.py#L618-L623
https://github.com/openstack/neutron/blob/e4cc40f114aed485a62ed8535813d6ee610ce41f/neutron/conf/plugins/ml2/drivers/ovn/ovn_conf.py#L90-L97

We also set the inactivity probe at the ovsdb side to the corresponding value by default:
https://opendev.org/x/charm-ovn-central/commit/9dcd53bb75805ff733c8f10b99724ea16a2b5f25
https://github.com/openvswitch/ovs/blob/5a686267d36c5c4229ec801a9616ceb60740fbe3/vswitchd/vswitch.xml#L5303-L5311

There was a recent change around the handling of inactivity probes in Neutron for connection strings that contain multiple addresses which is not yet in the Yoga UCA.

https://bugs.launchpad.net/neutron/+bug/1958364
https://review.opendev.org/c/openstack/neutron/+/825269
https://github.com/openstack/neutron/commit/789aa7122021e206bf07e377dc25df7a854b45fe

➜ neutron git:(master) ✗ git --no-pager tag --contains='789aa7122021e206bf07e377dc25df7a854b45fe'
21.0.0.0rc1

https://openstack-ci-reports.ubuntu.com/reports/cloud-archive/yoga_versions.html
2:20.2.0-0ubuntu1

Prior art introducing the 60s timeout in the first place suggests that OVS Python IDL should be better at handling this instead:
https://bugs.launchpad.net/networking-ovn/+bug/1772656/comments/3
https://review.opendev.org/c/openstack/networking-ovn/+/569977/

It's interesting that the OpenStack version difference has an effect on the issue appearing (Focal Yoga vs Focal Ussuri) as reported above.

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

Looking at the log there are indeed `ValueError: non-zero flags not allowed in calls to send()` messages there.

So I think it's not the issue with reconnecting because of a misconfiguration on the charm side at this point.

Neutron is repeatedly trying to initialize the OVN mech driver and fails to do so with that error.

2022-09-29 07:52:08.865 38789 ERROR neutron.service ValueError: non-zero flags not allowed in calls to send() on <class 'eventlet.green.ssl.GreenSSLSocket'>
2022-09-29 07:52:08.865 38789 ERROR neutron.service
2022-09-29 07:52:08.890 38789 CRITICAL neutron [req-28756c47-5db1-4094-9335-fa84dd7f2635 - - - - -] Unhandled error: ValueError: non-zero flags not allowed in calls to send() on <class 'eventlet.green.ssl.GreenSSLSocket'>

So it is likely that a reconnection is causing this crash to appear which affects neutron-server as a whole.

There is an open bug about it from what I can see for openvswitch:

https://bugs.launchpad.net/openvswitch/+bug/1985062

https://github.com/openvswitch/ovs/commit/1731ed43c6dca385ed1f6a7fb25148f0a34fd3b9 - upstream commit mentioned in the bug.

The test case here indeed confirms the failure mode:
https://bugs.launchpad.net/openvswitch/+bug/1985062/comments/3

Revision history for this message
DUFOUR Olivier (odufourc) wrote :

As far as I could test on my side on my lab :
* Focal Ussuri --> works fine
* Focal Wallaby --> works fine
* Focal Xena --> works fine

* Focal Yoga -x> is broken

The issue seems to be triggered only since Focal Yoga release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.