Cloud-Init cannot contact Meta-Data-Service on Xena with OVN

Bug #1949097 reported by Eugen Mayer
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Invalid
Undecided
Unassigned

Bug Description

## Brief
When running Xena with OVN, neither debian nor cirros cloud-init bootstraps can reach the meta-data service during boot.

- Using Wallaby+OVN works.
- Using Xena+OVS works.

Assuming this must be a regression in Xena, when OVN is used

## Pre-Conditions:
- Xena
- OVN
- Cirros Cloud init boot

Other environment details:
- non DVR - (not sure it is required, but that is what has been used)
- Debian 11 hosts (if that is a required, but that is what has been used)
- Ubuntu based docker - images (if that is a required, but that is what has been used)

Reproduction:
You can use https://github.com/EugenMayer/openstack-lab/tree/stable/ovn and run 'make start' to setup a vagrant setup. Vanilla-Kolla is used to deploy, based on the kolla 'stable/xena' branches.

After the deployment the following API request should be run: https://github.com/EugenMayer/openstack-lab/blob/stable/ovn/README.setup.md

The booted cirros instance should already show the errors.

The exact same stack can be tested using wallaby here https://github.com/EugenMayer/openstack-lab/blob/stable/ovn-wallaby - it will work. The only things that have been changed is the inventory and ansible: https://github.com/EugenMayer/openstack-lab/compare/stable/ovn...stable/ovn-wallaby?expand=1

## Expected output:
The cirros instance can reach the meta-data service like that (boot logs)

--------
WARN: failed: route add -net "0.0.0.0/0" gw "10.10.0.1"
OK
checking http://169.254.169.254/2009-04-04/instance-id
successful after 1/20 tries: up 9.89. iid=i-00000002
failed to get http://169.254.169.254/2009-04-04/user-data
warning: no ec2 metadata for user-data
--------

## Actual Output
--------
route: SIOCADDRT: File exists
WARN: failed: route add -net "0.0.0.0/0" gw "10.10.0.1"
OK
checking http://169.254.169.254/2009-04-04/instance-id
failed 1/20: up 1.34. request failed
failed 2/20: up 50.36. request failed
failed 3/20: up 99.38. request failed
--------

## Version

Yet hard to understand for me, since those are not tagged yet.
What-ever https://github.com/openstack/kolla-ansible/tree/stable/xena deploys

## Severity

Blocks us from using Xena entirely

## More Informations

Interface TCPDUMP and namespace information: https://gist.github.com/EugenMayer/3b7d1fc4a42d7fc911229f38eec891dd

In the meta-data agent, i can see this configuration.
On the compute instance, i see those 2 running https://gist.github.com/EugenMayer/b6611b9725a7697d0a392c1b3c1a5683

Those are the debug logs https://gist.github.com/EugenMayer/e2c5796f7c547c224a3aaccad2c35257
Normal logs: https://gist.github.com/EugenMayer/bbda954d207fc03b1fd9ea0bb6d143cf

https://gist.github.com/EugenMayer/fda894a3c6bb2bde5b9d1756b234cd79

Tags: ovn
Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hello Eugen:

We need more information to try to debug this issue. The metadata agent creates a namespace in the compute node when a VM is spawned on a network.

This namespace will be named as "ovnmeta-<neutron_network_id>". Please check that you have something similar to https://paste.opendev.org/show/810261/. The tap device IP must match the network subnet CIDR.

You should also have a haproxy instance running using a configuration file named like "...ovn-metadata-proxy/<neutron_network_id>.conf"

When the VM is started you should be able to dump the traffic of the namespace TAP device, going to 169.254.169.254/32. You should have something like https://paste.opendev.org/show/810262/.

The namespace should be deleted when no VM belonging to this network is present in this compute node.

Regards.

Eugen Mayer (eugenmayer)
description: updated
Revision history for this message
Eugen Mayer (eugenmayer) wrote :

As requested, added the namespace / interface information and the tcdump under more information (https://gist.github.com/EugenMayer/3b7d1fc4a42d7fc911229f38eec891dd)

description: updated
tags: added: ovn
Revision history for this message
Slawek Kaplonski (slaweq) wrote :

@Eugen: did You check if haproxy process is running in that namespace? What is in the neutron-ovn-metadata-agent logs? What is in the haproxy logs regarding those requests?

Revision history for this message
Eugen Mayer (eugenmayer) wrote :

After looking at a working and broken setup, i happened to find out that in the working setup the booted instance has the route

169.254.169.254 via 10.10.0.2 dev eth0

The broken box does not have this route. This leads to the issue that the actual requesting to the haproxy / metadata service 169.254.169.254 does timeout since the broken box does only have the default gateway to

0.0.0.0 via 10.10.0.1 dev eth0.

The question is what does push this route?

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hello:

Routes are pushed by DHCP server.

I'll mark this bug as invalid as this is not a problem in the OVS metadata service or OVN DHCP service.

Regards.

Changed in neutron:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.