[ovn] metadata route missing on the guest

Bug #1959098 reported by Przemyslaw Lal
20
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
New
Undecided
Unassigned
neutron (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

* High level description

Metadata server (169.254.169.254) is unreachable on VMs attached to only one affected network in the entire cluster. DHCP is enabled on that subnet and VMs get their IP addresses on boot, however the routing rule for metadata is missing:
$ ip r
default via 10.134.253.1 dev eth0
10.134.253.0/24 dev eth0 scope link src 10.134.253.181

Because of that cloud-init metadata requests are being sent to the router rather than ovnmeta netns.

On guests running in the unaffected network, routing table after booting or sending DHCP request looks like this and metadata endpoint is reachable:
$ ip r
default via 172.16.2.1 dev eth0
169.254.169.254 via 172.16.2.10 dev eth0
172.16.2.0/24 dev eth0 scope link src 172.16.2.248

I managed to work this around by manually adding a route to the metadata IP via DHCP port on the router attached to that network, however I believe it should not be needed and such configuration is definitely not present on all the "good" networks on this cluster.

Please let me know what logs and other information would be useful here.

* Step-by-step reproduction steps

1) Create a VM attached to the affected network.
2) Metadata server is unreachable, cloud-init fails because of the missing route not being provided by DHCP server.

* Expected output

I'd expect metadata route to be present on the guest:

$ ip r
default via 10.134.253.1 dev eth0
169.254.169.254 via 10.134.253.2 dev eth0
10.134.253.0/24 dev eth0 scope link src 10.134.253.181

* Actual output:

$ ip r
default via 10.134.253.1 dev eth0
10.134.253.0/24 dev eth0 scope link src 10.134.253.181

* Versions
neutron-common 2:16.4.1-0ubuntu2
neutron-ovn-metadata-agent 2:16.4.1-0ubuntu2
python3-neutron 2:16.4.1-0ubuntu2
python3-neutron-lib 2.3.0-0ubuntu1
python3-neutronclient 1:7.1.1-0ubuntu1
ovn-common 20.03.2-0ubuntu0.20.04.1
ovn-host 20.03.2-0ubuntu0.20.04.1
openvswitch-common 2.13.3-0ubuntu0.20.04.2
openvswitch-switch 2.13.3-0ubuntu0.20.04.2
python3-openvswitch 2.13.3-0ubuntu0.20.04.2
python3-ovsdbapp 1.1.0-0ubuntu2

Host OS: Ubuntu 20.04.3 LTS
Kernel: 5.8.0-48-generic #54~20.04.1-Ubuntu
Deployment: Juju charms

Guest OS: cirros 0.5.2 and Ubuntu 20.04, so most likely all distros are affected

* Environment

42 compute nodes, nova-compute 21.2.2-0ubuntu1 + libvirt 6.0.0-0ubuntu8.14 + KVM.
Deployed with Juju charms.

* Perceived severity

Not a blocker since there is a workaround.

Tags: ovn
Revision history for this message
yatin (yatinkarel) wrote :

Hi Przemysław Lal,

What you mean by "affected" network? Do you mean there are multiple networks in your setup, and out of those only one such network is misbehaving in terms of routes?

If above is true, is the "affected" network misbehaving since when it's created, or it used to work earlier and stopped working later? What other differences are there in affected and unaffected networks/subnets?

Following information would be good to collect for affected network, by dropping the workarounds:-
- openstack network show <network id>
- openstack subnet show <subnet id>
- openstack port list --device-owner network:distributed --network <affected network>
- ovn-nbctl find DHCP_Options external_ids:subnet_id=<subnet-id>

And also same info from unaffected network.

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hi Przemysław:

Apart from what Yatin requested, that is necessary to debug this issue, can you confirm "ovn_metadata_enabled" is True? I guess it is because other subnets have this port, but doesn't harm to double check.

Just in case, please check if you have [1] and [2] in your code. If you changed the subnet parameters ("dhcp_enable") without those patches, you could be in an undefined state now.

Can you check also in the server logs, during the network creation, if the metadata port was created too? [3].

Regards.

[1]https://review.opendev.org/q/I05394e49077a72199bbc80c8cb622ec2b17f2fa7
[2]https://review.opendev.org/q/I09cc14dff6933aae63cbd43a29f9221f405ecede
[3]https://github.com/openstack/neutron/blob/e7b70521d0e230143a80974e7e4795a2acafcc9b/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py#L1753

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in neutron (Ubuntu):
status: New → Confirmed
Revision history for this message
Jose Guedez (jfguedez) wrote :

We hit this issue today as well. Same symptoms:

* Failed to get metadata during VM launch - consistently and only on the "affected" network. Other networks like "unaffected" are OK.
* Missing metadata route inside VM
* After adding the route manually to the .2 IP we can ping/curl the metadata endpoint with no issues, so it seems the route is the only thing missing.
* The workaround of adding the metadata route explicitly to the relevant router allows new VMs in the affected network to get metadata without problems.

These are the current packages:

ii neutron-common 2:16.4.2-0ubuntu1
ii neutron-ovn-metadata-agent 2:16.4.2-0ubuntu1
ii python3-neutron 2:16.4.2-0ubuntu1
ii python3-neutron-lib 2.3.0-0ubuntu1
ii python3-neutronclient 1:7.1.1-0ubuntu1

I am attaching the information requested above for an "affected" and "unaffected" network. The main difference I see is that the "unaffected" subnet has the following option in the ovn-nb that is missing from the "affected" subnet:

classless_static_route="{169.254.169.254/32,10.131.83.2, 0.0.0.0/0,10.131.83.1}"

The two patches you mention are indeed included in python3-neutron 2:16.4.2-0ubuntu1. I additionally confirmed by checking /usr/lib/python3/dist-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py

Regarding "ovn_metadata_enabled", I didn't find it set to "true" in any config under /etc/neutron. I can only see the default commented out and no mention in neutron_ovn_metadata_agent.ini, which has the ovs/ovn config in it (but I am no expert)

/etc/neutron# grep -r ovn_metadata
ovn.ini:#ovn_metadata_enabled = false

The creation logs are no longer available. The ports for the .2 IPs are created in the subnet, and they do have a device_id of ovnmeta-<networkid>, but the device_owner is network:dhcp and not network:distributed as you seem to be expecting. I added the output of `port show` for them as well. Note that other networks on the same compute nodes have no issues providing metadata, including the "unaffected" network (data attached).

Revision history for this message
Jose Guedez (jfguedez) wrote :

unaffected network/subnet information

Revision history for this message
Jose Guedez (jfguedez) wrote :

affected network/subnet information

Revision history for this message
Jose Guedez (jfguedez) wrote :

Actually I can confirm ovn_metadata_enabled is set to "True". I was looking in the wrong place (compute/metadata agent server) and this seems to be set in the API server node:

/etc/neutron# grep -r ovn_metadata
plugins/ml2/ml2_conf.ini:ovn_metadata_enabled = True
ovn.ini:#ovn_metadata_enabled = false

Revision history for this message
Max Khon (fjoe) wrote (last edit ):

Looks like you are using Ussuri.

In case of OVN the route is not supposed to be used: https://docs.openstack.org/networking-ovn/latest/contributor/design/metadata_api.html

Instead OVN does all the proxying in OVN layer (see above)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.