IPv6 mgmt network not working, octavia can't talk to Amphora instance

Bug #1911788 reported by Hybrid512
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Octavia Charm
Incomplete
Undecided
Unassigned

Bug Description

Hi,

I'm trying to make Octavia Charm work but I encounter a few issues.
Apparently my deployment goes well (everything is green and I followed the Octavia charm guide to set it up automatically ie. auto create Amphora image with diskimage retrofit + configuring Octavia ressources automatically through the Charm).
Again, no errors and apparently, everything goes well.

Once connected to Horizon, I spawn a few VMs and then, I try to create a new LB for them.
First, until there is a LB create, I constantly get a very annoying notification popup saying there is no LB available for listing.

Then, I create a LB, fill everything to get a very simple round robin LB on 2 VMs in their private subnet.
Once completed, the LB is created but it stay stuck in "Offline/Pending Create" status until it ends in an "Error" state in which I can destroy it (it takes quite a few minutes before ending in that state).

While in "Pending Create" state, I checked the Octavia units logs and saw something like that :

2021-01-14 16:21:00.490 9938 WARNING octavia.amphorae.drivers.haproxy.rest_api_driver [-] Could not connect to instance. Retrying.: requests.exceptions.ConnectTimeout: HTTPSConnectionPool(host='fc00:fa21:3d5c:9cfd:f816:3eff:fef1:7fa1', port=9443): Max retries exceeded with url: // (Caused by ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7f61a01f9f10>, 'Connection to fc00:fa21:3d5c:9cfd:f816:3eff:fef1:7fa1 timed out. (connect timeout=10.0)'))

I then checked in the instances list and discovered the Amphora instance has been correctly created (I checked the console and it is fully booted, apparently with no error) and it has the fc00:fa21:3d5c:9cfd:f816:3eff:fef1:7fa1 IPv6 IP attached to it.

I then ssh to the Octavia units and tried to ping6 this IP and got no answer from every units so there seem to be an issue with the IPv6 management overlay network.

I tried to curl https://[fc00:fa21:3d5c:9cfd:f816:3eff:fef1:7fa1]:9443 and got no answer too.

Here is a detailed description of my setup :
* Charmed Openstack (20.10 stable) deployed through MaaS (2.9.0) with Juju (2.8.7) with the Focal series for every units (with Openstack Ussuri so "distro" openstack-origin)
* 7 machines (3 control planes, 4 compute nodes)
* HA mode for every control plane applications in conjunction to the HACluster charm with VIP assigned
* 3 spaces :
  - "pxe" : non routed untagged network for provisionning only
  - "ost-int" : routed /24 VLAN subnet for openstack internal network
  - "ost-pub" : routed /24 VLAN subnet for both openstack admin/public network

"ost-int" and "ost-pub" are routed, they can talk to each other, there is no firewall in between.

You can find my exported bundle attached.

I tried to deploy this bundle dozen of times with multiple spaces or only one space but that didn't change anything, I was never able to get Octavia work properly.
Amphora is created, LB is created but octavia API can't talk to the Amphora instance.

Revision history for this message
Hybrid512 (walid-moghrabi) wrote :
Revision history for this message
Frode Nordahl (fnordahl) wrote :

Thank you for the bug. To diagnose communication issues between the Octavia units and the cloud we must start by looking at the state of tunnels and port bindings.

Take a look at the Neutron ports created by the Octavia charm, for example `octavia-health-manager-octavia-0-listen-port`:
- Does the `binding_host_id` match the FQDN of the Octavia container?
- Does the `binding_vif_type` field say 'ovs' or does it say 'binding_failed'?

Since you are using OVN you will also have rich logging in /var/log/ovn/ovn-controller.log which would show evidence of any shortname vs. FQDN issues.

There is an ongoing issue with deploying OVS/OVN in LXD containers on MAAS as detailed in bug 1896630. The root of the issue is somewhere below the charms but the linked bug contain steps to work around the issue.

Changed in charm-octavia:
status: New → Incomplete
Revision history for this message
Hybrid512 (walid-moghrabi) wrote : Re: [Bug 1911788] Re: IPv6 mgmt network not working, octavia can't talk to Amphora instance
Download full text (15.2 KiB)

Hi,

I took a look at the bug report you gave me ... I don't have the same
behavior but it might be related.

------------------------------------------------------------------
openstack port list | grep octavia
| 6fd3e411-0f55-4f57-a49b-8978ac7045be |
octavia-health-manager-octavia-0-listen-port | fa:16:3e:4b:c6:48 |
ip_address='fc00:bee5:427a:2b79:f816:3eff:fe4b:c648',
subnet_id='e7e22722-af55-4b8d-b126-d5cf2e037c0d' | DOWN |
| 8878130c-c8a6-44f1-a668-d669b00a8e0d |
octavia-health-manager-octavia-2-listen-port | fa:16:3e:bd:2c:83 |
ip_address='fc00:bee5:427a:2b79:f816:3eff:febd:2c83',
subnet_id='e7e22722-af55-4b8d-b126-d5cf2e037c0d' | DOWN |
| fdce70ed-b861-4473-a483-2024b2733c75 |
octavia-health-manager-octavia-1-listen-port | fa:16:3e:a0:1d:a2 |
ip_address='fc00:bee5:427a:2b79:f816:3eff:fea0:1da2',
subnet_id='e7e22722-af55-4b8d-b126-d5cf2e037c0d' | DOWN |
------------------------------------------------------------------

------------------------------------------------------------------
openstack network agent list|grep lxd
| juju-37c2ba-2-lxd-16.maas | OVN Controller agent |
juju-37c2ba-2-lxd-16.maas | | :-) | UP |
ovn-controller |
| juju-37c2ba-0-lxd-18.maas | OVN Controller agent |
juju-37c2ba-0-lxd-18.maas | | :-) | UP |
ovn-controller |
| juju-37c2ba-1-lxd-17.maas | OVN Controller agent |
juju-37c2ba-1-lxd-17.maas | | :-) | UP |
ovn-controller |
------------------------------------------------------------------

------------------------------------------------------------------
openstack port show 6fd3e411-0f55-4f57-a49b-8978ac7045be
+-------------------------+--------------------------------------------------------------------------------------------------------+
| Field | Value
                                                       |
+-------------------------+--------------------------------------------------------------------------------------------------------+
| admin_state_up | UP
                                                        |
| allowed_address_pairs |
                                                       |
| binding_host_id | juju-37c2ba-2-lxd-16
                                                        |
| binding_profile |
                                                       |
| binding_vif_details |
                                                       |
| binding_vif_type | binding_failed
                                                        |
| binding_vnic_type | normal
                                                        |
| created_at | 2021-01-15T13:17:51Z
                                                        |
| data_plane_status | None
                                                        |
| description |
                                                       |
| device_id |
                                                       |
| device_owner | neutron:LOADBALANCERV2
                                ...

Revision history for this message
Frode Nordahl (fnordahl) wrote :

Thank you for providing more information, as you can see on the port details you are having exactly the same issue:
| binding_host_id | juju-37c2ba-2-lxd-16

The binding_host_id should have been set to the FQDN of the octavia unit, but as explained on the referenced bug that is not happening due to the container not being able to establish its FQDN at initial deploy or subsequent reboots.

I will mark this bug as a duplicate of the referenced bug.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.