Issues with OVN-based Octavia

Bug #2091931 reported by Matt Verran
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Snap
Incomplete
Undecided
Unassigned
ovn-octavia-provider (Ubuntu)
New
Undecided
Unassigned

Bug Description

Enabling Loadbalancer and CaaS in Sunbeam should ultimately result in a working K8s cluster and be used as any k8s cluster would be.

Example 1.

When istio is deployed it creates 3 load balancers ingress, egress and cni. Each of these is created seemingly without error however the status soon drops into 'error' state.

Assuming its not user error it depends on where this goes:

1) It is possible to work using OVN basd Ocatavia and its just that the cloud-controller/ingress-controller doesn't set it up correctly (the load balancer method doesn't get set to SOURCE_IP_PORT for example, possibly other issues) so its a bug there (where?)?

2) This setup actually requires amphora based Octavia, making this a wishlist feature request (in which case close and add this reasoning to https://bugs.launchpad.net/snap-openstack/+bug/2044567). I cannot find anything that says it won't work with OVN based Octavia, however I have not seen any confirmation it does.

3) Its both cases. In which case twice the fun.

Example 2

Creating Jenkins using Helm chart as per https://charts.jenkins.io, setting the option as below creates a LoadBalancer in Octavia the is similar to the above in that the pool is offline, there is no lb method set.

  set {
    name = "controller.serviceType"
    value = "LoadBalancer"
  }

Example 3

Following https://github.com/kubernetes/cloud-provider-openstack/blob/master/docs/openstack-cloud-controller-manager/expose-applications-using-loadbalancer-type-service.md results in no response from the curl command.

Revision history for this message
Matt Verran (mv-2112) wrote : Re: [CaaS] magnum + loadbalancer does not work for ingress

The issue would appear to be related to config here https://github.com/kubernetes/cloud-provider-openstack/blob/master/docs/openstack-cloud-controller-manager/using-openstack-cloud-controller-manager.md#load-balancer that relates to the following confiuration options for openstack-cloud-controller-manager.

lb-method The load balancing algorithm used to create the load balancer pool.

If lb-provider is set to "amphora" or "octavia" the value can be one of:

    ROUND_ROBIN (default)
    LEAST_CONNECTIONS
    SOURCE_IP

If lb-provider is set to "ovn" the value must be set to SOURCE_IP_PORT.

lb-provider Optional. Used to specify the provider of the load balancer, e.g. "amphora" (default), "octavia" (deprecated alias for "amphora"), "ovn" or "f5". Only the "amphora", "octavia", "ovn" and "f5" providers are officially tested, other providers will cause a warning log.

summary: - [CaaS] magnum + loadbalancer does not give a working setup for istio
+ [CaaS] magnum + loadbalancer does not work for ingress
description: updated
Matt Verran (mv-2112)
description: updated
Revision history for this message
Matt Verran (mv-2112) wrote :

Interestingly Octavia logs say the opposite... something broken here.

2024-12-20T10:36:44.804Z [wsgi-octavia-api] 2024-12-20 10:36:44.804000 octavia.common.exceptions.ProviderUnsupportedOptionError: Provider 'ovn' does not support a requested option: OVN provider does not support SOURCE_IP algorithm

Revision history for this message
Matt Verran (mv-2112) wrote :

Same issues are occurring with known good terraformed Websphere Liberty example pool and members in error state, SOURCE_IP was specified but on viewing the config has not applied.

Revision history for this message
Matt Verran (mv-2112) wrote :

2024.1/beta (637)

Revision history for this message
Matt Verran (mv-2112) wrote (last edit ):

Further investigation...

1) microstack.run instructions and terraform don't produce an error and both specify SOURCE_IP_PORT. The pool still fails to configure itself correctly. he pool lb method is blank and cannot be corrected.

2) manually created via Horizon will not set up a pool and triggers SOURCE_IP not supported by OVN errors in Octavia.

3) magnum/openstack controllerin k8s also triggers SOURCE_IP not supported by OVN errors in Octavia but manage to setup all the elements, they cannot be edited. The pool lb method is blank and cannot be corrected.

Revision history for this message
Matt Verran (mv-2112) wrote (last edit ):

Theres a number of magic steps needed to deploy jenkins, so a simpler test was found (forget where i found it but will add credit if i find again)

kubectl run hostname-server --image=lingxiankong/alpine-test --port=8080
kubectl expose pod hostname-server --type=LoadBalancer --target-port=8080 --port=80 --name hostname-server

This worked perfectly.

Ultimately i have found multiple other issues in deploying jenkins and now have a working example.

Findings (updated from later investigation):-

1) Horizon as deployed in sunbeam is not OVN Octavia aware and so does not allow SOURCE_IP_PORT. I think this is a bug of its own.

2) The order of enabling does not appear to be relevant

3) Istio load balancers have multiple listeners which is possibly the differentiator?

Matt Verran (mv-2112)
Changed in snap-openstack:
status: New → Incomplete
Revision history for this message
Matt Verran (mv-2112) wrote :
Download full text (5.6 KiB)

So this is working for K8s api, etcd, and a helm deployed jenkins.

It is not working for istio deployed via helm. 10.0.0.17 is the magnum node, 10.0.0.161 the magnum master

2025-01-07T10:40:45.233Z [octavia-driver-agent] 2025-01-07 10:40:45.232 891 WARNING ovn_octavia_provider.helper [-] Member for event not found, info: {'ovn_lbs': [<ovsdbapp.backend.ovs_idl.rowview.RowView object at 0x704842ffe3f0>, <ovsdbapp.backend.ovs_idl.rowview.RowView object at 0x704842f18410>, <ovsdbapp.backend.ovs_idl.rowview.RowView object at 0x704842f1bb00>, <ovsdbapp.backend.ovs_idl.rowview.RowView object at 0x704842f1b2f0>, <ovsdbapp.backend.ovs_idl.rowview.RowView object at 0x704842f1a120>], 'ip': '10.0.0.161', 'port': '31922', 'status': ['offline']}
2025-01-07T10:40:45.233Z [octavia-driver-agent] 2025-01-07 10:40:45.233 891 DEBUG ovn_octavia_provider.helper [-] Updating status to octavia: {'loadbalancers': [{'id': '604f70d9-1157-47c5-a24d-bf266a0cc5b3', 'provisioning_status': 'ACTIVE', 'operating_status': 'DEGRADED'}], 'listeners': [{'id': '191f649a-3662-4a76-b7e9-a980e4d80f4b', 'provisioning_status': 'ACTIVE', 'operating_status': 'ONLINE'}, {'id': '79a77b2e-9050-43fd-bc52-1f52540f3815', 'provisioning_status': 'ACTIVE', 'operating_status': 'DEGRADED'}, {'id': 'fb7b148a-238b-4746-aec1-66975c7e5100', 'provisioning_status': 'ACTIVE', 'operating_status': 'ONLINE'}], 'pools': [{'id': 'ca44ff02-13f1-4343-a3ed-9d07c0a0946d', 'provisioning_status': 'ACTIVE', 'operating_status': 'ONLINE'}, {'id': 'c11ad91c-82d4-46a5-b83a-0a9d73c069f6', 'provisioning_status': 'ACTIVE', 'operating_status': 'DEGRADED'}, {'id': '5aee19ae-6fdd-4619-bd17-b359cd3cfcf3', 'provisioning_status': 'ACTIVE', 'operating_status': 'ONLINE'}], 'members': [{'id': '6ec135c4-2072-4706-9427-e269573b91ac', 'provisioning_status': 'ACTIVE', 'operating_status': 'ONLINE'}, {'id': '26ed7636-305c-47ac-99a1-d6f057af0b36', 'provisioning_status': 'ACTIVE', 'operating_status': 'ONLINE'}, {'id': 'c48c567f-88d9-4dbe-a959-3640ace37821', 'provisioning_status': 'ACTIVE', 'operating_status': 'ERROR'}, {'id': 'd697a2c7-7097-4650-8af8-419f5b391a8f', 'provisioning_status': 'ACTIVE', 'operating_status': 'ONLINE'}, {'id': 'd55783ca-3867-4f7e-a8c0-ce46ccbc1cda', 'provisioning_status': 'ACTIVE', 'operating_status': 'ONLINE'}, {'id': '622cbc32-d48d-42c3-89d7-4d8652114321', 'provisioning_status': 'ACTIVE', 'operating_status': 'ONLINE'}]} _update_status_to_octavia /usr/lib/python3/dist-packages/ovn_octavia_provider/helper.py:440
2025-01-07T10:40:45.284Z [octavia-driver-agent] 2025-01-07 10:40:45.283 1186 DEBUG oslo_db.sqlalchemy.engines [-] MySQL server mode set to STRICT_TRANS_TABLES,STRICT_ALL_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,TRADITIONAL,NO_ENGINE_SUBSTITUTION _check_effective_sql_mode /usr/lib/python3/dist-packages/oslo_db/sqlalchemy/engines.py:342
2025-01-07T10:40:45.624Z [octavia-driver-agent] 2025-01-07 10:40:45.624 891 DEBUG ovsdbapp.backend.ovs_idl.event [-] Matched UPDATE: ServiceMonitorUpdateEvent(events=('update', 'delete'), table='Service_Monitor', conditions=None, old_conditions=None), priority=20 to row=Service_Monitor(ip=10.0.0.17, src_mac=06:e1:d0:cc:fc:26, logical_port=a95b5516-4d37-4...

Read more...

Revision history for this message
Matt Verran (mv-2112) wrote :
Revision history for this message
Matt Verran (mv-2112) wrote :

Perhaps the difference here is that the three istio load balancers are 1 lb -> 3 listeners, each backed by a pool of 2 members.

Revision history for this message
Matt Verran (mv-2112) wrote (last edit ):

Dropping some additional info out of helper.py around here: https://opendev.org/openstack/ovn-octavia-provider/src/commit/e87aeed5d55568c065d4a317e801ce9ad99c1a84/ovn_octavia_provider/helper.py#L3261 shows that a given test for a healthmonitor check on a pool with two members, 6 members are actually being returned.

Listing the K,V pairs as it runs gives this smoking gun...

Key: neutron:member_status, Value: {"ecdb6f9f-89b7-479c-b265-59d31669f0fb": "ONLINE", "f34ddd60-9b4a-4abb-be3b-4e3b2b522f3a": "ONLINE", "0dc10077-ed51-4ab5-9151-630c3968a7d3": "ERROR", "4006b2f8-5b7b-4bbc-9e89-38a34d76684f": "ERROR", "469c5816-db1b-446d-b4be-fbd9caf40079": "ERROR", "52060ae3-81e6-4d4a-bf47-18f710d063e4": "ERROR"}

Each of the above is a member, so it was either the members in the same LB, or members on the same port across the three Istio LB's. Tracking it back these are all members on the same LB, istio's ingress in this case. Since only the first two are on the ip/port combination tested on line 3276 its failing the the other two pairs even though they shouldn't be tested at this point.

Listener 1
0526a054-1106-4894-95ca-d5882b002434 - "ecdb6f9f-89b7-479c-b265-59d31669f0fb": "ONLINE", "f34ddd60-9b4a-4abb-be3b-4e3b2b522f3a": "ONLINE"

Listener 2 for same LB but different port/pool, since not on that pool it errors
f0df1c3a-8fb3-430c-adf7-b9e70e116c54 - "0dc10077-ed51-4ab5-9151-630c3968a7d3": "ERROR", "4006b2f8-5b7b-4bbc-9e89-38a34d76684f": "ERROR"

Listener 3 for same LB but different port/pool, since not on that pool it errors
dcb7169e-09fe-4b0f-974b-6a92d4882428 - "469c5816-db1b-446d-b4be-fbd9caf40079": "ERROR", "52060ae3-81e6-4d4a-bf47-18f710d063e4": "ERROR"

This is in my opinion a second Octavia OVN bug.

Revision history for this message
Matt Verran (mv-2112) wrote (last edit ):

@james-page - whats the best course of action here... these are not Sunbeam bugs, but if no immediate plans to move to Amphora based Octavia then these will be issues that are hit.

Bug 1

Horizon doesn't support SOURCE_IP_PORT in the GUI

Bug 2

Octavia healthchecks using OVN provider fail if more than one listener is configured on the LB. (use case: deploying istio ontop of magnum cluster)

Revision history for this message
Matt Verran (mv-2112) wrote :
Matt Verran (mv-2112)
Changed in snap-openstack:
status: Incomplete → Opinion
summary: - [CaaS] magnum + loadbalancer does not work for ingress
+ Issues with OVN-based Octavia
Matt Verran (mv-2112)
Changed in snap-openstack:
status: Opinion → Incomplete
Revision history for this message
Matt Verran (mv-2112) wrote :
Revision history for this message
Matt Verran (mv-2112) wrote :

Applying ppa from http://ppa.launchpadcontent.net/james-page/caracal did not resurrect an errored LB, nor did it allow a newly created lb to fall into this state.

Revision history for this message
Matt Verran (mv-2112) wrote :
Revision history for this message
Matt Verran (mv-2112) wrote :

The lb's with multiple listeners for different ports as created by k8s (for helm deployed istio) still drop into an error state.

The lb created by k8s for a lb with a single listener (for helm deployed jenkins) still work fine as before

Revision history for this message
Matt Verran (mv-2112) wrote :

I've performed another test on this...

Creating the scenario resembling the magnum setup with simple vm's.

- create 6 vm's, install apache2
- edit /etc/apache2/ports.conf to Listen 80 Listen 8080 Listen 9090 and restart. curl locally to check its working.
- create security group to allow 80,8080,9090
- create a load balancer with three listeners, on 80, 8080, and 9090.
- create three pools using SOURCE_IP_PORT, one for 80 on vm1 and vm2, another for 8080 on vm3 and vm4, and finally for 9090 on vm5 and vm6.
- create health monitors for TCP.
- assign a floating ip to the load balancer.

Results at this point:-

- port 80 will work, 8080 and 9090 will timeout.

Further testing

- transfer the pool for the port 80 listener to use members vm5 and vm6.Port 80 will still work. 8080 and 9090 still timeout.
- delete the listener and pool for 80, it will still work. 8080 and 9090 still timeout.
- delete the listener and pool for 8080, port 80 will still work. 8080 and 9090 still timeout.
- delete vm5 and vm6 - port 80 STILL WORKS!!! 8080 and 9090 still timeout.
- delete vm1 and vm2 (where port 80 was originally) and port 80 stops responding. 8080 and 9090 still timeout.
- repoint the listener for 9090 to vm3 and vm4. Nothing responds.

Summary

It appears the LB is setup with the first listener and pool/members regardless, and no other traffic can work even if the healthcheck scenario seen on magnum was to be resolved. For the use of Sunbeam and Magnum, ovn doesn't appear to function as needed and contains showstopping issues.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.