DHCP agent fails to fully configure DHCP namespaces because of duplicate address detected

Bug #1953165 reported by Pierre Riteau
66
This bug affects 11 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Bence Romsics

Bug Description

After upgrading a Neutron/ML2 OVS deployment from Ussuri to Victoria, updating the host OS from CentOS Linux 8 to CentOS Stream 8, and rebooting, DHCP was not functional on some but not all networks.

DHCP agent logs included the following error multiple times:

2021-11-30 17:05:35.475 7 ERROR neutron.agent.dhcp.agent [-] Failure waiting for address fe80::a9fe:a9fe to become ready: Duplicate address detected: neutron.agent.linux.ip_lib.AddressNotReady: Failure waiting for address fe80::a9fe:a9fe to become ready: Duplicate address detected
2021-11-30 17:05:35.475 7 ERROR neutron.agent.dhcp.agent Traceback (most recent call last):
2021-11-30 17:05:35.475 7 ERROR neutron.agent.dhcp.agent File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/common/utils.py", line 165, in call
2021-11-30 17:05:35.475 7 ERROR neutron.agent.dhcp.agent return func(*args, **kwargs)
2021-11-30 17:05:35.475 7 ERROR neutron.agent.dhcp.agent File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/agent/dhcp/agent.py", line 401, in safe_configure_dhcp_for_network
2021-11-30 17:05:35.475 7 ERROR neutron.agent.dhcp.agent self.configure_dhcp_for_network(network)
2021-11-30 17:05:35.475 7 ERROR neutron.agent.dhcp.agent File "/var/lib/kolla/venv/lib/python3.6/site-packages/osprofiler/profiler.py", line 160, in wrapper
2021-11-30 17:05:35.475 7 ERROR neutron.agent.dhcp.agent result = f(*args, **kwargs)
2021-11-30 17:05:35.475 7 ERROR neutron.agent.dhcp.agent File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/agent/dhcp/agent.py", line 415, in configure_dhcp_for_network
2021-11-30 17:05:35.475 7 ERROR neutron.agent.dhcp.agent self.update_isolated_metadata_proxy(network)
2021-11-30 17:05:35.475 7 ERROR neutron.agent.dhcp.agent File "/var/lib/kolla/venv/lib/python3.6/site-packages/osprofiler/profiler.py", line 160, in wrapper
2021-11-30 17:05:35.475 7 ERROR neutron.agent.dhcp.agent result = f(*args, **kwargs)
2021-11-30 17:05:35.475 7 ERROR neutron.agent.dhcp.agent File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/agent/dhcp/agent.py", line 758, in update_isolated_metadata_proxy
2021-11-30 17:05:35.475 7 ERROR neutron.agent.dhcp.agent self.enable_isolated_metadata_proxy(network)
2021-11-30 17:05:35.475 7 ERROR neutron.agent.dhcp.agent File "/var/lib/kolla/venv/lib/python3.6/site-packages/osprofiler/profiler.py", line 160, in wrapper
2021-11-30 17:05:35.475 7 ERROR neutron.agent.dhcp.agent result = f(*args, **kwargs)
2021-11-30 17:05:35.475 7 ERROR neutron.agent.dhcp.agent File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/agent/dhcp/agent.py", line 816, in enable_isolated_metadata_proxy
2021-11-30 17:05:35.475 7 ERROR neutron.agent.dhcp.agent self.conf, bind_address=constants.METADATA_V4_IP, **kwargs)
2021-11-30 17:05:35.475 7 ERROR neutron.agent.dhcp.agent File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/agent/metadata/driver.py", line 271, in spawn_monitored_metadata_proxy
2021-11-30 17:05:35.475 7 ERROR neutron.agent.dhcp.agent ).wait_until_address_ready(address=bind_address_v6)
2021-11-30 17:05:35.475 7 ERROR neutron.agent.dhcp.agent File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/agent/linux/ip_lib.py", line 597, in wait_until_address_ready
2021-11-30 17:05:35.475 7 ERROR neutron.agent.dhcp.agent exception=AddressNotReady(address=address, reason=errmsg))
2021-11-30 17:05:35.475 7 ERROR neutron.agent.dhcp.agent File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/common/utils.py", line 701, in wait_until_true
2021-11-30 17:05:35.475 7 ERROR neutron.agent.dhcp.agent while not predicate():
2021-11-30 17:05:35.475 7 ERROR neutron.agent.dhcp.agent File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/agent/linux/ip_lib.py", line 591, in is_address_ready
2021-11-30 17:05:35.475 7 ERROR neutron.agent.dhcp.agent address=address, reason=_('Duplicate address detected'))
2021-11-30 17:05:35.475 7 ERROR neutron.agent.dhcp.agent neutron.agent.linux.ip_lib.AddressNotReady: Failure waiting for address fe80::a9fe:a9fe to become ready: Duplicate address detected
2021-11-30 17:05:35.475 7 ERROR neutron.agent.dhcp.agent

The tap interface inside each affected qdhcp namespace was in a state like this:

35: tap0f8bb343-c1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether fa:16:3e:ed:6f:60 brd ff:ff:ff:ff:ff:ff
    inet 169.254.169.254/32 brd 169.254.169.254 scope global tap0f8bb343-c1
       valid_lft forever preferred_lft forever
    inet 10.18.0.10/16 brd 10.18.255.255 scope global tap0f8bb343-c1
       valid_lft forever preferred_lft forever
    inet6 fe80::a9fe:a9fe/64 scope link dadfailed tentative
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:feed:6f60/64 scope link
       valid_lft forever preferred_lft forever

Note the dadfailed status on the fe80::a9fe:a9fe/64 address, which caused Neutron to raise an AddressNotReady exception.

I tried restarting dhcp-agent multiple times. Occasionally DHCP for one network would configure correctly, but most of the times the list of affected networks would stay the same.

I found out that removing the fe80::a9fe:a9fe/64 address from the tap interface of each affected namespace followed by restarting dhcp-agent fixed the issue: no more dadfailed status.

Version information:

* OpenStack Victoria deployed with Kolla source images
* neutron 17.2.2.dev70 (using stable/victoria from 2021-11-28)
* CentOS Stream release 8
* Linux kernel 4.18.0-348.2.1.el8_5.x86_64 #1 SMP Tue Nov 16 14:42:35 UTC 2021

Pierre Riteau (priteau)
summary: - DHCP agent fails to configure DHCP namespaces because of duplicate
+ DHCP agent fails to fully configure DHCP namespaces because of duplicate
address detected
Revision history for this message
Brian Haley (brian-haley) wrote :

So this happens when you have more than one DHCP agent only, correct? Using isolated subnets?

It looks like an oversight when we added support for metadata over IPv6, since using the same link-local address on multiple nodes will fail in DAD as you show above.

Just thinking out loud there might be a couple of options:

1) Neutron tells only one DHCP agent to configure the IPv6 metadata address. It reduces availability, and there might be some edge cases, but could work.

2) We change to use an Anycast address, in which only one of the nodes will get the request. But this is more complicated as 1) Anycast addresses are only supposed to be configured on routers (which don't exist here); and 2) IANA assigns Anycast addresses, https://www.iana.org/assignments/ipv6-anycast-addresses/ipv6-anycast-addresses.xhtml

A quick fix for you would be to set this in neutron.conf:

dhcp_agents_per_network = 1

Changed in neutron:
importance: Undecided → High
status: New → Confirmed
tags: added: ipv6
Revision history for this message
Pierre Riteau (priteau) wrote :

This is actually on a deployment with three OpenStack controllers, each running dhcp-agent, but not in HA mode, so dhcp_agents_per_network is already at the default value of 1.

We also have a non-default setting: enable_isolated_metadata = true.

Most of the networks are tenant networks, but there are a few provider networks with Neutron DHCP enabled.

Revision history for this message
Brian Haley (brian-haley) wrote :

I know Bence is also looking at this, but one more question since I don't have a running devstack at the moment. You have enable_isolated_metadata set to True, but there is a router attached? If yes, does it's namespace have this IPv6 address configured as well? It might be a moot point would just be good to know. Thanks.

Revision history for this message
Bence Romsics (bence-romsics) wrote :

I suspect this may be a duplicate of:
https://bugs.launchpad.net/neutron/+bug/1930414

Revision history for this message
Bence Romsics (bence-romsics) wrote (last edit ):

Brian: If my suspicion is right that this is a duplicate of that other bug, then this is not metadata specific. It is actually not even IPv6 specific, but only IPv6 has default DAD detection which detects that dhcp ports may leak traffic shortly while they are plugged to the dead vlan. This could be confirmed by having multiple v6 subnets with the exact same range, so neutron chooses the exact same address for the dhcp ports and this should go to dadfailed just as well as the metadata address does.

But regarding your metadata related question: only the dhcp namespace should have the metadata address configured. IIRC in the router namespace we do not have the metadata address configured. Instead we catch that traffic by an iptables redirect to the loopback address where haproxy listens.

Revision history for this message
Kamil Madac (kamil-madac) wrote :

We experienced same bug last week as I described on mailing list http://lists.openstack.org/pipermail/openstack-discuss/2022-January/026484.html. This bug has severe consequences when dadfailed state is not noticied by operators.

When dhcp agent is restarted and there are dhcp namespaces with interfaces in dadfailed state, NetworkCache in dhcp agent is not updated with subnets, which causes that subsequent creation of VM or update of port of VM in such network will delete the namespace completely which then causes connectivity outage to all VMs in such network.

I think we should fix that if exception is raised in dhcp agent in configure_dhcp_for_network in update_isolated_metadata_proxy() function, self.cache.put(network) should be called in each case to ensure that NetworkCache is updated correctly and dhcp namespace won't be delete in next SyncState call.

Here is the code from agent.py which I'm talking about

    def configure_dhcp_for_network(self, network):
        if not network.admin_state_up:
            return

        for subnet in network.subnets:
            if subnet.enable_dhcp:
                if self.call_driver('enable', network):
                    self.update_isolated_metadata_proxy(network)
                    self.cache.put(network)
                    # After enabling dhcp for network, mark all existing
                    # ports as ready. So that the status of ports which are
                    # created before enabling dhcp can be updated.
                    self.dhcp_ready_ports |= {p.id for p in network.ports}
                break

Revision history for this message
Mark Goddard (mgoddard) wrote :

I had the same symptoms as this issue, with slightly different steps to reproduce.

On an existing wallaby deployment using neutron ML2/OVS, create a VLAN network & IPv4 subnet. The DHCP agent logs show the same backtrace as in the original description. Restarting the DHCP agent shows the same backtrace each time.

While DHCP seems to work, metadata does not.

There are two other networks which do not exhibit this issue.

Worked around the issue as suggested:

ip netns # find ID of network
ip netns exec qdhcp-dee9459f-7ed8-4627-9c42-4006ec09d5fd bash
ip a | grep dadfailed
ip a del fe80::a9fe:a9fe/64 dev tapf0cd099d-aa

FWIW, the system is Ubuntu 20.04 based, and is deployed via kayobe/kolla-ansible.

Revision history for this message
Stig Telfer (stigtelfer) wrote :

I don't think this bug is a duplicate of #1930414, or at least the fix for #1930414 (https://opendev.org/openstack/neutron/commit/9d5cea0e2bb85b3b6ea27eb71279c57c419b0102) does not fix this issue.

I have reproduced the issue on a Wallaby OpenStack deployment which has the backported fix applied.

My workaround is to disable IPV6 on the controller nodes via sysctl (net.ipv6.conf.all.disable_ipv6: 1)

Revision history for this message
Maximilian Stinsky (mstinsky) wrote :

We just ran into this exact issue after upgrading our openstack environment to victoria.

As our installation is running on version 17.4.1 I dont think that this is a duplicate of https://bugs.launchpad.net/neutron/+bug/1930414, same as stigtelfer said the fix for this bug is already applied in this version.

As we require ipv6 to be enabled we are at the moment evaluating a workaround to set the following sysctl values on our nodes that are hosting the dhcp namespaces:
net.core.devconf_inherit_init_net=1
net.ipv6.conf.default.accept_dad=0

As kamil-madac already mentioned this bug has very high impact as we are losing dhcp namespaces.

Revision history for this message
Maximilian Stinsky (mstinsky) wrote :

As expected setting net.core.devconf_inherit_init_net=1 and net.ipv6.conf.default.accept_dad=0 and then letting the dhcp agent recreate all existing namespaces fixes the issue and we are not losing any namespaces anymore.

Maybe it would be a good workaround fix for the neutron code to set accept_dad=0 on the tap interface when creating it inside the dhcp namespace before adding the ipv6 metadata address?

Revision history for this message
Pierre Riteau (priteau) wrote :

We are still seeing this issue after upgrades to Xena and Yoga with the latest stable code: change I0391dd24224f8656a09ddb002e7dae8783ba37a4 (Make sure "dead vlan" ports cannot transmit packets) doesn't seem to help.

Revision history for this message
Pierre Riteau (priteau) wrote :
Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :
Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

We should not allow duplicated IPv4/IPv6 addresses. As commented by Bence, we are not currently providing metadata HA (we never did). Instead of allowing a configuration that could lead to potential errors, we should catch the error and not spawn the metadata server in the DHCP agent. At the same time, we should spawn a thread monitoring the interface. If the DHCP agent hosting the metadata server goes down, the others should try to configure the interface (metadata IPv4 or IPv6 addresses) and spawn the server (as proposed in [1]).

[1]https://bugs.launchpad.net/neutron/+bug/2007938/comments/4

Revision history for this message
John Garbutt (johngarbutt) wrote :

Could we configure the metadata address as anycast addresses? I don't really know the implications of that suggestions though.

FWIW, I am looking at patching this downstream to never do the IPV6 address for metadata, until we all come up with something better.

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hello John:

You can also disable the "enable_isolated_metadata" flag (if possible, maybe you can't).

About the anycast address, we use the AWS metadata IPv4 address [1] (not the IPv6) one. We could make this change but IMO this is a bit intrusive; I wouldn't go this way.

Regards.

[1]https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-metadata.html

Revision history for this message
Bence Romsics (bence-romsics) wrote :

Clearly my suspicion in comment #4 was wrong. But because of that, this issue fell off my radar. Sorry for that.

I'm looking to reproduce the side/followup effect mentioned by Kamil in comment #6 in my dev environment. But until I can do that, if you can, please tell me, for this side effect to happen:

1) How many dhcp-agents do you have?
2) What dhcp_agents_per_network setting do you use?

My current thinking goes:
If we have only one agent scheduled to a network, then we cannot have DAD failures.
If we have more than one agent scheduled to a network, then (at least initially) one will work okay and the rest will have DAD failures and because of that will have the followup problems too.

But this means we should only lose the redundancy of dhcp, not dhcp service itself. Of course I can imagine situations in which this leads to loss of dhcp later: for example having two agents, initially one okay, one with the DAD failure and after turning off the okay one, the other cannot recover and produce a successful failover.

But I'd like really to understand in which ways (maybe multiple) we may get to losing all dhcp.

Regarding workarounds mentioned:

* enable_isolated_metadata=False should be safe if you can live with l3-agent's metadata
* net.ipv6.conf.all.disable_ipv6=1 should be safe if you can live without ipv6

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/876566

Revision history for this message
Bence Romsics (bence-romsics) wrote :

Until we come up with an actual fix, please feel free to experiment with the above (#18) stopgap workaround. It should at least get rid of the dhcp problems.

Changed in neutron:
assignee: nobody → Bence Romsics (bence-romsics)
Changed in neutron:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/876903

Revision history for this message
Brian Haley (brian-haley) wrote :

After some thinking and discussions on IRC, I pushed the above
patch to try and address this problem. Still have to do some
testing, and need to discuss further at our weekly Friday
meeting.

Relevant comment from changelog:

   This helps fix the DAD failure bug in the following way.
    Using a ULA allows us to configure it on the loopback
    device, and inject a route into the VM via the DHCP
    agent IP address. This is what we do today for IPv4
    metadata on an isolated network. We cannot just move
    fe80::a9fe:a9fe to the loopback device and inject a
    route for it as it is a link-local address and traffic
    cannot be forwarded for it.

    In order to not break backwards compatibility, the
    initially chosen metadata address, fe80::a9fe:a9fe, is
    still configured in exactly the same (broken) way. We
    can eventually remove it when we update the cloud-init
    Openstack code and wait some amount of time.

Revision history for this message
Brian Haley (brian-haley) wrote :
Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hello Brian:

* If the ULA address is set in the "lo" device, how the VM will be able to access to this port?
* What address will have the DHCP namespace external interface (the one connecting the DHCP namespace with OVS)?
* From the IRC conversation, if the DHCP agent goes down, the metadata server "assigned" to the VM will fail.
* What happens with the duplicated replies from different DHCP agents (as shown by rubasov)?

Regards.

Revision history for this message
Bence Romsics (bence-romsics) wrote :
Download full text (4.3 KiB)

Before I review Brian's new patch I just wanted to better understand how the ipv4 metadata works. Let me document my findings here:

I created a test environment with 3 hosts and dhcp_agents_per_network=2:

devstack0 - dhcp agent serving net1, dhcp port address: 10.0.4.2
devstack0a - dhcp agent serving net1, dhcp port address: 10.0.4.3
devstack0b - vm0 booted on net1

dhcp servers push the following routes:

devstack0 $ sudo ip netns exec qdhcp-$( openstack net show net1 -f value -c id ) cat /opt/stack/data/neutron/dhcp/$( openstack net show net1 -f value -c id )/opts
tag:subnet-b4f511be-e5de-46ab-b0fb-d6276797fd6c,option6:domain-search,openstacklocal
tag:subnet-dfa281b1-f0a3-4425-a972-45ce80c5f4d5,option:classless-static-route,169.254.169.254/32,10.0.4.2,0.0.0.0/0,10.0.4.1
tag:subnet-dfa281b1-f0a3-4425-a972-45ce80c5f4d5,249,169.254.169.254/32,10.0.4.2,0.0.0.0/0,10.0.4.1
tag:subnet-dfa281b1-f0a3-4425-a972-45ce80c5f4d5,option:router,10.0.4.1
tag:subnet-b4f511be-e5de-46ab-b0fb-d6276797fd6c,option6:dns-server,[2001:db8::2],[2001:db8::1]

devstack0a $ sudo ip netns exec qdhcp-$( openstack net show net1 -f value -c id ) cat /opt/stack/data/neutron/dhcp/$( openstack net show net1 -f value -c id )/opts
tag:subnet-b4f511be-e5de-46ab-b0fb-d6276797fd6c,option6:domain-search,openstacklocal
tag:subnet-dfa281b1-f0a3-4425-a972-45ce80c5f4d5,option:classless-static-route,169.254.169.254/32,10.0.4.3,0.0.0.0/0,10.0.4.1
tag:subnet-dfa281b1-f0a3-4425-a972-45ce80c5f4d5,249,169.254.169.254/32,10.0.4.3,0.0.0.0/0,10.0.4.1
tag:subnet-dfa281b1-f0a3-4425-a972-45ce80c5f4d5,option:router,10.0.4.1
tag:subnet-dfa281b1-f0a3-4425-a972-45ce80c5f4d5,option:dns-server,10.0.4.2,10.0.4.3
tag:subnet-b4f511be-e5de-46ab-b0fb-d6276797fd6c,option6:dns-server,[2001:db8::1],[2001:db8::2]

The freshly booted vm0 has this routing table:

$ ip r
default via 10.0.4.1 dev eth0
10.0.4.0/24 dev eth0 scope link src 10.0.4.185
169.254.169.254 via 10.0.4.3 dev eth0

The metadata address replies to ping:
$ ping -c3 169.254.169.254
PING 169.254.169.254 (169.254.169.254): 56 data bytes
64 bytes from 169.254.169.254: seq=0 ttl=64 time=1.980 ms
64 bytes from 169.254.169.254: seq=1 ttl=64 time=3.646 ms
64 bytes from 169.254.169.254: seq=2 ttl=64 time=1.778 ms

--- 169.254.169.254 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 1.778/2.468/3.646 ms

tcpdump in the dhcp namespaces on the dhcp port's tap interface confirms that traffic goes to devstack0a only - as expected.

Let's change the route in the guest:
$ ip r del 169.254.169.254
$ ip r add 169.254.169.254 via 10.0.4.2

# ping still works
$ ping -c 3 169.254.169.254
PING 169.254.169.254 (169.254.169.254): 56 data bytes
64 bytes from 169.254.169.254: seq=0 ttl=64 time=2.094 ms
64 bytes from 169.254.169.254: seq=1 ttl=64 time=2.048 ms
64 bytes from 169.254.169.254: seq=2 ttl=64 time=1.815 ms

--- 169.254.169.254 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 1.815/1.985/2.094 ms

tcpdump confirms that all traffic goes to devstack0.

NOTE: This is contradicting that Linux performs any duplicate address detection for IPv4 LL. It is hard t...

Read more...

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hi Bence:

You are right on the statement that this is not HA. But, as you mentioned, if the "assigned" metadata server fails, a new one can be reached renewing the DHCP leases. That should be documented and this is working only for IPv4.

I'll review Brian's patch now.

Regards.

Revision history for this message
Brian Haley (brian-haley) wrote :

Hi Rodolfo,

Will try and answer your questions.

* If the ULA address is set in the "lo" device, how the VM will be able to access to this port?

We would have to program a route to it, similar to what we do with IPv4 today, which the patch does. This would only be in the isolated case. As I mentioned in the patch I think we could also put the IPv4 metadata address on lo as well, but I would do that in another patch since it seems to work fine today.

* What address will have the DHCP namespace external interface (the one connecting the DHCP namespace with OVS)?

The interface from the DHCP namespace to the subnet would just have a single IP - that of the DHCP service. Unless I mis-understood the question?

* From the IRC conversation, if the DHCP agent goes down, the metadata server "assigned" to the VM will fail.

Yes, the route to the metadata address via the DHCP IP will not work. This would be the same as IPv4 as it is also a route in the isolated network case.

* What happens with the duplicated replies from different DHCP agents (as shown by rubasov)?

The VM will only choose a single DHCP server if multiple respond, so only install a single route to metadata. We would have to test that further with IPv6 to make sure.

Bence - thanks for the update, and glad on DHCP failover the "extra" 169.254 route was not present.

Revision history for this message
Brian Haley (brian-haley) wrote :

This bug seems to be causing gate instability in this fullstack job:

neutron.tests.fullstack.test_dhcp_agent.TestDhcpAgentHA.test_multiple_agents_for_network(Open vSwitch agent)

It just seems more of a coincidence that the multi-DHCP agent job is the one failing on random changes, but hasn't failed on the proposed change by Bence. For that reason I'll add the gate-failure tag here, push a change based on the latest comments, and increase the review priority on the change. It should be enough to work around the problem until we can meet and talk about my additional change.

tags: added: gate-failure
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/877135

Revision history for this message
Brian Haley (brian-haley) wrote :

Removed gate-failure tag as further testing showed it doesn't solve the fullstack problem, at least in its current form.

tags: removed: gate-failure
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron-lib (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron-lib/+/880588

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron-lib (master)

Reviewed: https://review.opendev.org/c/openstack/neutron-lib/+/880588
Committed: https://opendev.org/openstack/neutron-lib/commit/a7b950cf8b5f322b7526b2b3eafb397b4f054c28
Submitter: "Zuul (22348)"
Branch: master

commit a7b950cf8b5f322b7526b2b3eafb397b4f054c28
Author: Bence Romsics <email address hidden>
Date: Mon Apr 17 09:32:18 2023 +0200

    FUP Suppress IPv6 metadata DAD failure and delete address

    In the partial fix
    https://review.opendev.org/c/openstack/neutron/+/876566 to bug #1953165,
    we changed the metadata cidr netmask from /64 to /128.

    This patch updates neutron-lib accordingly.

    Change-Id: Ib6807d3ccdcea4b440961f1fb7f212f9b982b2c5
    Related-Bug: #1953165

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/2023.1)

Fix proposed to branch: stable/2023.1
Review: https://review.opendev.org/c/openstack/neutron/+/880929

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/zed)

Fix proposed to branch: stable/zed
Review: https://review.opendev.org/c/openstack/neutron/+/880957

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/yoga)

Fix proposed to branch: stable/yoga
Review: https://review.opendev.org/c/openstack/neutron/+/880960

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/xena)

Fix proposed to branch: stable/xena
Review: https://review.opendev.org/c/openstack/neutron/+/880964

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/wallaby)

Fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/neutron/+/880967

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/876566
Committed: https://opendev.org/openstack/neutron/commit/2aee961ab6942ab59aeacdc93d918c8c19023041
Submitter: "Zuul (22348)"
Branch: master

commit 2aee961ab6942ab59aeacdc93d918c8c19023041
Author: Bence Romsics <email address hidden>
Date: Mon Mar 6 13:04:01 2023 +0100

    Suppress IPv6 metadata DAD failure and delete address

    IPv4 DAD is non-existent in Linux or its failure is silent, so we
    never needed to catch and ignore it. On the other hand IPv6 DAD
    failure is explicit, hence comes this change.

    This of course leaves the metadata service dead on hosts where
    duplicate address detection failed. But if we catch the
    DADFailed exception and delete the address, at least other
    functions of the dhcp-agent should not be affected.

    With this the IPv6 isolated metadata service is not redundant, which
    is the best we can do without a redesign.

    Also document the promised service level of isolated metadata.

    Added additional tests for the metadata driver as well.

    Change-Id: I6b544c5528cb22e5e8846fc47dfb8b05f70f975c
    Partial-Bug: #1953165

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/victoria)

Fix proposed to branch: stable/victoria
Review: https://review.opendev.org/c/openstack/neutron/+/881703

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/2023.1)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/880929
Committed: https://opendev.org/openstack/neutron/commit/071255f098e0e73fd5220f83cbbc8ac1c421f3ab
Submitter: "Zuul (22348)"
Branch: stable/2023.1

commit 071255f098e0e73fd5220f83cbbc8ac1c421f3ab
Author: Bence Romsics <email address hidden>
Date: Mon Mar 6 13:04:01 2023 +0100

    Suppress IPv6 metadata DAD failure and delete address

    IPv4 DAD is non-existent in Linux or its failure is silent, so we
    never needed to catch and ignore it. On the other hand IPv6 DAD
    failure is explicit, hence comes this change.

    This of course leaves the metadata service dead on hosts where
    duplicate address detection failed. But if we catch the
    DADFailed exception and delete the address, at least other
    functions of the dhcp-agent should not be affected.

    With this the IPv6 isolated metadata service is not redundant, which
    is the best we can do without a redesign.

    Also document the promised service level of isolated metadata.

    Added additional tests for the metadata driver as well.

    Conflicts:
        neutron/conf/agent/database/agentschedulers_db.py
            conflict with 831ac3152dc

    Change-Id: I6b544c5528cb22e5e8846fc47dfb8b05f70f975c
    Partial-Bug: #1953165
    (cherry picked from commit 2aee961ab6942ab59aeacdc93d918c8c19023041)

Revision history for this message
Florian Engelmann (engelmann) wrote :

Hi,

I am testing this patch with stable/zed, but it looks like this patch does delete those "dadfailed" interfaces but it does not start/restart the haproxy process in an affected netns.

for i in $(ip netns ls | awk '/qdhcp-/ { print $1}'); do PIDS="$(ip netns pids $i | xargs ps co command=)"; if [[ ! "$PIDS" =~ "haproxy" ]]; then echo $i; fi;done | sort
qdhcp-0614c965-c47d-47b4-bae2-acf24b191605
qdhcp-0b62624a-4be2-479a-9fe6-40d7fc5c9b83
qdhcp-0fb9e437-9d62-4144-9d5d-b2b062680b89
qdhcp-211261df-4c82-4d30-b460-7296e555758f
qdhcp-79c41e38-5a10-4de3-83e9-bd69c8f97092
qdhcp-a88e1f6d-df26-49c7-b562-52354decb3d2
qdhcp-c32bbe2e-bbdf-4ad7-87ff-1ab05881a3e5
qdhcp-e81fa39a-79af-4028-a5dd-df7cbc6ad762

cat /var/lib/docker/volumes/kolla_logs/_data/neutron/neutron-dhcp-agent.log | awk '/DAD failed/ { print $17}' | sort | uniq
qdhcp-0614c965-c47d-47b4-bae2-acf24b191605
qdhcp-0b62624a-4be2-479a-9fe6-40d7fc5c9b83
qdhcp-0fb9e437-9d62-4144-9d5d-b2b062680b89
qdhcp-211261df-4c82-4d30-b460-7296e555758f
qdhcp-79c41e38-5a10-4de3-83e9-bd69c8f97092
qdhcp-a88e1f6d-df26-49c7-b562-52354decb3d2
qdhcp-c32bbe2e-bbdf-4ad7-87ff-1ab05881a3e5
qdhcp-e81fa39a-79af-4028-a5dd-df7cbc6ad762

So every qdhcp netns with a dadfailed interface does not have its haproxy started.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/zed)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/880957
Committed: https://opendev.org/openstack/neutron/commit/1c615281f7632f3f1cf4bd37eefe90c50c6dfe25
Submitter: "Zuul (22348)"
Branch: stable/zed

commit 1c615281f7632f3f1cf4bd37eefe90c50c6dfe25
Author: Bence Romsics <email address hidden>
Date: Mon Mar 6 13:04:01 2023 +0100

    Suppress IPv6 metadata DAD failure and delete address

    IPv4 DAD is non-existent in Linux or its failure is silent, so we
    never needed to catch and ignore it. On the other hand IPv6 DAD
    failure is explicit, hence comes this change.

    This of course leaves the metadata service dead on hosts where
    duplicate address detection failed. But if we catch the
    DADFailed exception and delete the address, at least other
    functions of the dhcp-agent should not be affected.

    With this the IPv6 isolated metadata service is not redundant, which
    is the best we can do without a redesign.

    Also document the promised service level of isolated metadata.

    Added additional tests for the metadata driver as well.

    Conflicts:
        neutron/tests/unit/agent/linux/test_dhcp.py
            conflict with 74224e79e031636018b970fac9c2aa72516eb12d
        neutron/tests/unit/agent/metadata/test_driver.py
            conflict with 3d575f8bd066ce2eb46353a49a8c6850ba9e4387

    Change-Id: I6b544c5528cb22e5e8846fc47dfb8b05f70f975c
    Partial-Bug: #1953165
    (cherry picked from commit 2aee961ab6942ab59aeacdc93d918c8c19023041)
    (cherry picked from commit 071255f098e0e73fd5220f83cbbc8ac1c421f3ab)

tags: added: in-stable-zed
tags: added: in-stable-yoga
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/yoga)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/880960
Committed: https://opendev.org/openstack/neutron/commit/defb6018f3a395094cc85a03b93a2a0b43d2f6ff
Submitter: "Zuul (22348)"
Branch: stable/yoga

commit defb6018f3a395094cc85a03b93a2a0b43d2f6ff
Author: Bence Romsics <email address hidden>
Date: Mon Mar 6 13:04:01 2023 +0100

    Suppress IPv6 metadata DAD failure and delete address

    IPv4 DAD is non-existent in Linux or its failure is silent, so we
    never needed to catch and ignore it. On the other hand IPv6 DAD
    failure is explicit, hence comes this change.

    This of course leaves the metadata service dead on hosts where
    duplicate address detection failed. But if we catch the
    DADFailed exception and delete the address, at least other
    functions of the dhcp-agent should not be affected.

    With this the IPv6 isolated metadata service is not redundant, which
    is the best we can do without a redesign.

    Also document the promised service level of isolated metadata.

    Added additional tests for the metadata driver as well.

    Change-Id: I6b544c5528cb22e5e8846fc47dfb8b05f70f975c
    Partial-Bug: #1953165
    (cherry picked from commit 2aee961ab6942ab59aeacdc93d918c8c19023041)
    (cherry picked from commit 071255f098e0e73fd5220f83cbbc8ac1c421f3ab)
    (cherry picked from commit 1c615281f7632f3f1cf4bd37eefe90c50c6dfe25)

tags: added: in-stable-xena
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/xena)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/880964
Committed: https://opendev.org/openstack/neutron/commit/1d674825ebbe5fcab6c8fef7d03b5cf9b332b743
Submitter: "Zuul (22348)"
Branch: stable/xena

commit 1d674825ebbe5fcab6c8fef7d03b5cf9b332b743
Author: Bence Romsics <email address hidden>
Date: Mon Mar 6 13:04:01 2023 +0100

    Suppress IPv6 metadata DAD failure and delete address

    IPv4 DAD is non-existent in Linux or its failure is silent, so we
    never needed to catch and ignore it. On the other hand IPv6 DAD
    failure is explicit, hence comes this change.

    This of course leaves the metadata service dead on hosts where
    duplicate address detection failed. But if we catch the
    DADFailed exception and delete the address, at least other
    functions of the dhcp-agent should not be affected.

    With this the IPv6 isolated metadata service is not redundant, which
    is the best we can do without a redesign.

    Also document the promised service level of isolated metadata.

    Added additional tests for the metadata driver as well.

    Conflicts:
        neutron/tests/unit/agent/metadata/test_driver.py
            conflict with f430cd00725f8303f5313cb7784c9aed4b982e62

    Change-Id: I6b544c5528cb22e5e8846fc47dfb8b05f70f975c
    Partial-Bug: #1953165
    (cherry picked from commit 2aee961ab6942ab59aeacdc93d918c8c19023041)
    (cherry picked from commit 071255f098e0e73fd5220f83cbbc8ac1c421f3ab)
    (cherry picked from commit 1c615281f7632f3f1cf4bd37eefe90c50c6dfe25)
    (cherry picked from commit defb6018f3a395094cc85a03b93a2a0b43d2f6ff)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/880967
Committed: https://opendev.org/openstack/neutron/commit/f53cff4a9c57bb39db8baf3f4a41ade085af98b4
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit f53cff4a9c57bb39db8baf3f4a41ade085af98b4
Author: Bence Romsics <email address hidden>
Date: Mon Mar 6 13:04:01 2023 +0100

    Suppress IPv6 metadata DAD failure and delete address

    IPv4 DAD is non-existent in Linux or its failure is silent, so we
    never needed to catch and ignore it. On the other hand IPv6 DAD
    failure is explicit, hence comes this change.

    This of course leaves the metadata service dead on hosts where
    duplicate address detection failed. But if we catch the
    DADFailed exception and delete the address, at least other
    functions of the dhcp-agent should not be affected.

    With this the IPv6 isolated metadata service is not redundant, which
    is the best we can do without a redesign.

    Also document the promised service level of isolated metadata.

    Added additional tests for the metadata driver as well.

    Change-Id: I6b544c5528cb22e5e8846fc47dfb8b05f70f975c
    Partial-Bug: #1953165
    (cherry picked from commit 2aee961ab6942ab59aeacdc93d918c8c19023041)
    (cherry picked from commit 071255f098e0e73fd5220f83cbbc8ac1c421f3ab)
    (cherry picked from commit 1c615281f7632f3f1cf4bd37eefe90c50c6dfe25)
    (cherry picked from commit defb6018f3a395094cc85a03b93a2a0b43d2f6ff)
    (cherry picked from commit 1d674825ebbe5fcab6c8fef7d03b5cf9b332b743)

tags: added: in-stable-wallaby
tags: added: in-stable-victoria
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/victoria)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/881703
Committed: https://opendev.org/openstack/neutron/commit/080770cd7b0331e708d54970cdda5fb6b3bc1b20
Submitter: "Zuul (22348)"
Branch: stable/victoria

commit 080770cd7b0331e708d54970cdda5fb6b3bc1b20
Author: Bence Romsics <email address hidden>
Date: Mon Mar 6 13:04:01 2023 +0100

    Suppress IPv6 metadata DAD failure and delete address

    IPv4 DAD is non-existent in Linux or its failure is silent, so we
    never needed to catch and ignore it. On the other hand IPv6 DAD
    failure is explicit, hence comes this change.

    This of course leaves the metadata service dead on hosts where
    duplicate address detection failed. But if we catch the
    DADFailed exception and delete the address, at least other
    functions of the dhcp-agent should not be affected.

    With this the IPv6 isolated metadata service is not redundant, which
    is the best we can do without a redesign.

    Also document the promised service level of isolated metadata.

    Added additional tests for the metadata driver as well.

    Change-Id: I6b544c5528cb22e5e8846fc47dfb8b05f70f975c
    Partial-Bug: #1953165
    (cherry picked from commit 2aee961ab6942ab59aeacdc93d918c8c19023041)
    (cherry picked from commit 071255f098e0e73fd5220f83cbbc8ac1c421f3ab)
    (cherry picked from commit 1c615281f7632f3f1cf4bd37eefe90c50c6dfe25)
    (cherry picked from commit defb6018f3a395094cc85a03b93a2a0b43d2f6ff)
    (cherry picked from commit 1d674825ebbe5fcab6c8fef7d03b5cf9b332b743)
    (cherry picked from commit f53cff4a9c57bb39db8baf3f4a41ade085af98b4)

Revision history for this message
Brian Haley (brian-haley) wrote :

Hi Florian. I think I understand the problem. I think we decided to not start haproxy on systems where IPv6 DAD failed. But that means IPv4 metadata service will not work? Is that the issue you are seeing?

This one line is probably causing that.

https://review.opendev.org/c/openstack/neutron/+/876566/13/neutron/agent/metadata/driver.py#268

Will have to see if Bence has an opinion on that but we might have just missed it assuming there was always more than one agent running.

Revision history for this message
Bence Romsics (bence-romsics) wrote :

Hi Florian and Brian,

> So every qdhcp netns with a dadfailed interface does not have its haproxy started.

I believe this to be true with the code we merged. However what is the consequence of it? As Brian asked: is the ipv4 metadata service not working?

We may not have made the best choice to put that return there, because it ties the fate of the ipv4 metadata service (in that ns) to the fate of the ipv6 LL DAD. But there's no guarantee these are in sync.

Anyway, before making a followup change, I believe we should understand exactly what error remains with the current code.

Revision history for this message
Brian Haley (brian-haley) wrote :

Agreed Bence, I actually have a patch that removes that 'return', seems like maybe an oversight since we were always testing in a multi-agent setup?

Will be at least next week before I can look closer unless Florian responds it helps.

Revision history for this message
Florian Engelmann (engelmann) wrote :

Hi,

I am still testing. Will keep you updated!

Question:
What about "transparent" haproxy binding?

https://docs.haproxy.org/2.4/configuration.html#5.1-transparent

This option should allow us to spawn haproxy even if the ipv6 /dadfailure) was deleted. So online VMs with outdated DHCP information are still able to query metadata via ipv4.

What do you think?

Revision history for this message
Brian Haley (brian-haley) wrote :

Ok, thanks for testing.

If you want to create a patch implementing something with transparent proxy I'd review it, since I'm not sure how exactly it would work in this case at the moment.

Revision history for this message
Florian Engelmann (engelmann) wrote :

Ok I will try to create that patch on top of yours.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/883193

Revision history for this message
Brian Haley (brian-haley) wrote :

I proposed that change ^^^ to see if it helps with the issue, but marked WIP so it will not merge (yet).

Revision history for this message
Florian Engelmann (engelmann) wrote :

Thank you Brian,

your latest WIP patch (https://review.opendev.org/c/openstack/neutron/+/883193) together with the "transparent" binding option does allow haproxy to spawn in each DHCP netns even if dadfailure nics get deleted.

for i in $(ip netns ls | awk '/qdhcp-/ { print $1 }'); do if [ $(ip netns pids $i | wc -l | grep 2) ]; then echo $i; fi ; done | sort
qdhcp-0614c965-c47d-47b4-bae2-acf24b191605
qdhcp-0b62624a-4be2-479a-9fe6-40d7fc5c9b83
qdhcp-0fb9e437-9d62-4144-9d5d-b2b062680b89
qdhcp-211261df-4c82-4d30-b460-7296e555758f
qdhcp-25aadba7-44bf-41bf-aaff-015a581e2d21
qdhcp-6810dff4-b8c0-40c3-a755-62b5ebaf8369
qdhcp-6bd9cbec-4b84-4adc-ac9c-901cb90a8eec
qdhcp-79c41e38-5a10-4de3-83e9-bd69c8f97092
qdhcp-8c15292e-9335-438b-b166-0a3cffd5f233
qdhcp-a88e1f6d-df26-49c7-b562-52354decb3d2
qdhcp-c32bbe2e-bbdf-4ad7-87ff-1ab05881a3e5
qdhcp-cd696a35-adc7-4a8e-aa81-fdabd0ceafde
qdhcp-e346b34b-2e47-4221-888a-4d1880cb34d8
qdhcp-e81fa39a-79af-4028-a5dd-df7cbc6ad762

docker exec -ti neutron_dhcp_agent ls /var/lib/neutron/ns-metadata-proxy/ | sort
0614c965-c47d-47b4-bae2-acf24b191605.conf
0b62624a-4be2-479a-9fe6-40d7fc5c9b83.conf
0fb9e437-9d62-4144-9d5d-b2b062680b89.conf
211261df-4c82-4d30-b460-7296e555758f.conf
25aadba7-44bf-41bf-aaff-015a581e2d21.conf
6810dff4-b8c0-40c3-a755-62b5ebaf8369.conf
6bd9cbec-4b84-4adc-ac9c-901cb90a8eec.conf
79c41e38-5a10-4de3-83e9-bd69c8f97092.conf
8c15292e-9335-438b-b166-0a3cffd5f233.conf
a88e1f6d-df26-49c7-b562-52354decb3d2.conf
c32bbe2e-bbdf-4ad7-87ff-1ab05881a3e5.conf
cd696a35-adc7-4a8e-aa81-fdabd0ceafde.conf
e346b34b-2e47-4221-888a-4d1880cb34d8.conf
e81fa39a-79af-4028-a5dd-df7cbc6ad762.conf

I will try to create a complete patch asap.

Revision history for this message
Florian Engelmann (engelmann) wrote :

without the transparent binding option it looks like follows:

docker exec -ti neutron_dhcp_agent ls /var/lib/neutron/ns-metadata-proxy/ | sort
0614c965-c47d-47b4-bae2-acf24b191605.conf
0b62624a-4be2-479a-9fe6-40d7fc5c9b83.conf
0fb9e437-9d62-4144-9d5d-b2b062680b89.conf
211261df-4c82-4d30-b460-7296e555758f.conf
25aadba7-44bf-41bf-aaff-015a581e2d21.conf
6810dff4-b8c0-40c3-a755-62b5ebaf8369.conf
6bd9cbec-4b84-4adc-ac9c-901cb90a8eec.conf
79c41e38-5a10-4de3-83e9-bd69c8f97092.conf
8c15292e-9335-438b-b166-0a3cffd5f233.conf
a88e1f6d-df26-49c7-b562-52354decb3d2.conf
c32bbe2e-bbdf-4ad7-87ff-1ab05881a3e5.conf
cd696a35-adc7-4a8e-aa81-fdabd0ceafde.conf
e346b34b-2e47-4221-888a-4d1880cb34d8.conf
e81fa39a-79af-4028-a5dd-df7cbc6ad762.conf

for i in $(ip netns ls | awk '/qdhcp-/ { print $1 }'); do if [ $(ip netns pids $i | wc -l | grep 2) ]; then echo $i; fi ; done | sort
qdhcp-25aadba7-44bf-41bf-aaff-015a581e2d21
qdhcp-6810dff4-b8c0-40c3-a755-62b5ebaf8369
qdhcp-6bd9cbec-4b84-4adc-ac9c-901cb90a8eec
qdhcp-8c15292e-9335-438b-b166-0a3cffd5f233
qdhcp-cd696a35-adc7-4a8e-aa81-fdabd0ceafde
qdhcp-e346b34b-2e47-4221-888a-4d1880cb34d8

So haproxy fails to start/bind in some of the netns:

2023-05-17 18:11:41.378 7 ERROR neutron.agent.linux.utils [-] Exit code: 1; Cmd: ['ip', 'netns', 'exec', 'qdhcp-e81fa39a-79af-4028-a5dd-df7cbc6ad762', 'haproxy', '-f', '/var/lib/neutron/ns-metadata-proxy/e81fa39a-79af-4028-a5dd-df7cbc6ad762.conf']; Stdin: ; Stdout: ; Stderr: [NOTICE] (708) : haproxy version is 2.4.22-0ubuntu0.22.04.1
[NOTICE] (708) : path to executable is /usr/sbin/haproxy
[ALERT] (708) : Starting proxy listener: cannot bind socket (Cannot assign requested address) [fe80::a9fe:a9fe:80]
[ALERT] (708) : [haproxy.main()] Some protocols failed to start their listeners! Exiting.

Revision history for this message
Florian Engelmann (engelmann) wrote :

diff --git a/neutron/agent/metadata/driver.py b/neutron/agent/metadata/driver.py
index 0b23879354..bbc5569e29 100644
--- a/neutron/agent/metadata/driver.py
+++ b/neutron/agent/metadata/driver.py
@@ -144,7 +144,7 @@ class HaproxyConfigurator(object):
         }
         if self.host_v6 and self.bind_interface:
             cfg_info['bind_v6_line'] = (
- 'bind %s:%s interface %s' % (
+ 'bind %s:%s interface %s transparent' % (
                     self.host_v6, self.port, self.bind_interface)
             )
         # If using the network ID, delete any spurious router ID that might

Revision history for this message
Bence Romsics (bence-romsics) wrote :

Hi Florian and Brian,

Sorry for the slow reaction! And thanks for all the testing.

I'm wondering if we could get rid of the "cannot bind socket" error without the use of haproxy transparent binding. I did not try it yet, but it seems to me that with a small refactoring that should be possible. Basically we would need to move this:

https://opendev.org/openstack/neutron/src/commit/181177fe885f49f2d31f3e175d33efbf21ac3676/neutron/agent/metadata/driver.py#L243-L249

to between lines 277-278, and there we would already know the outcome of DAD and based on that we could control _get_metadata_proxy_callback(bind_address_v6=...).

This way we would not depend on haproxy compile options.

What do you think?

Revision history for this message
Brian Haley (brian-haley) wrote :

Hi Bence,

Yes, it looks like that should work as well. If you want I can update my small change, https://review.opendev.org/c/openstack/neutron/+/883193 to move that code?

Revision history for this message
Brian Haley (brian-haley) wrote :

I updated my patch based on this.

Revision history for this message
Florian Engelmann (engelmann) wrote :

Hi Bence and Brian,

I will test the new patch asap.
Just a small note: it is not a compile option regarding haproxy. It is just a configuration parameter.

Please see the patch of the haproxy.conf template inside the neutron driver code:

diff --git a/neutron/agent/metadata/driver.py b/neutron/agent/metadata/driver.py
index 0b23879354..bbc5569e29 100644
--- a/neutron/agent/metadata/driver.py
+++ b/neutron/agent/metadata/driver.py
@@ -144,7 +144,7 @@ class HaproxyConfigurator(object):
         }
         if self.host_v6 and self.bind_interface:
             cfg_info['bind_v6_line'] = (
- 'bind %s:%s interface %s' % (
+ 'bind %s:%s interface %s transparent' % (
                     self.host_v6, self.port, self.bind_interface)
             )
         # If using the network ID, delete any spurious router ID that might Hide

Revision history for this message
Florian Engelmann (engelmann) wrote :

Hi Bence and Brian,

I tested the updated patch (https://review.opendev.org/c/openstack/neutron/+/883193) and it looks like all haproxies are spawned:

docker exec -ti neutron_dhcp_agent ls /var/lib/neutron/ns-metadata-proxy/ | sort | wc -l
40

for i in $(ip netns ls | awk '/qdhcp-/ { print $1 }'); do if [ $(ip netns pids $i | wc -l | grep 2) ]; then echo $i; fi ; done | sort | wc -l
40

nice!

All the best,
Florian

Revision history for this message
Brian Haley (brian-haley) wrote :

Thanks for the testing Florian.

For now I think I'll just not add the 'transparent' keyword since it seems to work.

I still have to send an update to retry adding the IPv6 link-local, for example, when the dhcp agent hosting it becomes unavailable.

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Adding https://bugs.launchpad.net/neutron/+bug/2022321 to this bug because of the detailed description of the issue detected.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/2023.1)

Related fix proposed to branch: stable/2023.1
Review: https://review.opendev.org/c/openstack/neutron/+/885249

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/zed)

Related fix proposed to branch: stable/zed
Review: https://review.opendev.org/c/openstack/neutron/+/885270

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/xena)

Related fix proposed to branch: stable/xena
Review: https://review.opendev.org/c/openstack/neutron/+/885271

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/yoga)

Related fix proposed to branch: stable/yoga
Review: https://review.opendev.org/c/openstack/neutron/+/885272

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/wallaby)

Related fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/neutron/+/885273

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/victoria)

Related fix proposed to branch: stable/victoria
Review: https://review.opendev.org/c/openstack/neutron/+/885274

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/883193
Committed: https://opendev.org/openstack/neutron/commit/846003c4379124de6ffb3628ef1feb12a62a9cfa
Submitter: "Zuul (22348)"
Branch: master

commit 846003c4379124de6ffb3628ef1feb12a62a9cfa
Author: Brian Haley <email address hidden>
Date: Mon May 15 12:29:42 2023 -0400

    Start metadata proxy even if IPv6 DAD fails

    A recent change suppressed the IPv6 DAD failure and
    removed the address when multiple DHCP agents were
    configured on the same network,
    https://review.opendev.org/c/openstack/neutron/+/880957

    But it also changed the behavior to not enable IPv4
    metadata in this case. Restore the old behavior by
    not returning early in the DAD failure case. The callback
    that builds the config file was moved until after
    the address was bound to make the two steps more obvious.

    Related-bug: #1953165
    Change-Id: I8436c6c9da9a2533ca27ff7312f5b2c7ea41e94f

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/2023.1)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/885249
Committed: https://opendev.org/openstack/neutron/commit/e7f85abae6a46a115582b80d1909e3565d859e9b
Submitter: "Zuul (22348)"
Branch: stable/2023.1

commit e7f85abae6a46a115582b80d1909e3565d859e9b
Author: Brian Haley <email address hidden>
Date: Mon May 15 12:29:42 2023 -0400

    Start metadata proxy even if IPv6 DAD fails

    A recent change suppressed the IPv6 DAD failure and
    removed the address when multiple DHCP agents were
    configured on the same network,
    https://review.opendev.org/c/openstack/neutron/+/880957

    But it also changed the behavior to not enable IPv4
    metadata in this case. Restore the old behavior by
    not returning early in the DAD failure case. The callback
    that builds the config file was moved until after
    the address was bound to make the two steps more obvious.

    Conflicts:
        neutron/tests/unit/agent/metadata/test_driver.py

    Related-bug: #1953165
    Change-Id: I8436c6c9da9a2533ca27ff7312f5b2c7ea41e94f
    (cherry picked from commit 846003c4379124de6ffb3628ef1feb12a62a9cfa)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/zed)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/885270
Committed: https://opendev.org/openstack/neutron/commit/1a711f399abebff6572551ef4e3f7b92397caab5
Submitter: "Zuul (22348)"
Branch: stable/zed

commit 1a711f399abebff6572551ef4e3f7b92397caab5
Author: Brian Haley <email address hidden>
Date: Mon May 15 12:29:42 2023 -0400

    Start metadata proxy even if IPv6 DAD fails

    A recent change suppressed the IPv6 DAD failure and
    removed the address when multiple DHCP agents were
    configured on the same network,
    https://review.opendev.org/c/openstack/neutron/+/880957

    But it also changed the behavior to not enable IPv4
    metadata in this case. Restore the old behavior by
    not returning early in the DAD failure case. The callback
    that builds the config file was moved until after
    the address was bound to make the two steps more obvious.

    Conflicts:
        neutron/tests/unit/agent/metadata/test_driver.py

    Related-bug: #1953165
    Change-Id: I8436c6c9da9a2533ca27ff7312f5b2c7ea41e94f
    (cherry picked from commit 846003c4379124de6ffb3628ef1feb12a62a9cfa)
    (cherry picked from commit e7f85abae6a46a115582b80d1909e3565d859e9b)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/yoga)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/885272
Committed: https://opendev.org/openstack/neutron/commit/b37a2f80ee11715643929e1af809aa9edd814ed0
Submitter: "Zuul (22348)"
Branch: stable/yoga

commit b37a2f80ee11715643929e1af809aa9edd814ed0
Author: Brian Haley <email address hidden>
Date: Mon May 15 12:29:42 2023 -0400

    Start metadata proxy even if IPv6 DAD fails

    A recent change suppressed the IPv6 DAD failure and
    removed the address when multiple DHCP agents were
    configured on the same network,
    https://review.opendev.org/c/openstack/neutron/+/880957

    But it also changed the behavior to not enable IPv4
    metadata in this case. Restore the old behavior by
    not returning early in the DAD failure case. The callback
    that builds the config file was moved until after
    the address was bound to make the two steps more obvious.

    Conflicts:
        neutron/tests/unit/agent/metadata/test_driver.py

    Related-bug: #1953165
    Change-Id: I8436c6c9da9a2533ca27ff7312f5b2c7ea41e94f
    (cherry picked from commit 846003c4379124de6ffb3628ef1feb12a62a9cfa)
    (cherry picked from commit e7f85abae6a46a115582b80d1909e3565d859e9b)
    (cherry picked from commit 1a711f399abebff6572551ef4e3f7b92397caab5)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/xena)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/885271
Committed: https://opendev.org/openstack/neutron/commit/14b239b28048ffbb11322c9ed03ee5c22d84edbd
Submitter: "Zuul (22348)"
Branch: stable/xena

commit 14b239b28048ffbb11322c9ed03ee5c22d84edbd
Author: Brian Haley <email address hidden>
Date: Mon May 15 12:29:42 2023 -0400

    Start metadata proxy even if IPv6 DAD fails

    A recent change suppressed the IPv6 DAD failure and
    removed the address when multiple DHCP agents were
    configured on the same network,
    https://review.opendev.org/c/openstack/neutron/+/880957

    But it also changed the behavior to not enable IPv4
    metadata in this case. Restore the old behavior by
    not returning early in the DAD failure case. The callback
    that builds the config file was moved until after
    the address was bound to make the two steps more obvious.

    Conflicts:
        neutron/tests/unit/agent/metadata/test_driver.py

    Related-bug: #1953165
    Change-Id: I8436c6c9da9a2533ca27ff7312f5b2c7ea41e94f
    (cherry picked from commit 846003c4379124de6ffb3628ef1feb12a62a9cfa)
    (cherry picked from commit e7f85abae6a46a115582b80d1909e3565d859e9b)
    (cherry picked from commit 1a711f399abebff6572551ef4e3f7b92397caab5)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/885273
Committed: https://opendev.org/openstack/neutron/commit/495ef9f37c0d87c7ca11308c347e46f481f19eab
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit 495ef9f37c0d87c7ca11308c347e46f481f19eab
Author: Brian Haley <email address hidden>
Date: Mon May 15 12:29:42 2023 -0400

    Start metadata proxy even if IPv6 DAD fails

    A recent change suppressed the IPv6 DAD failure and
    removed the address when multiple DHCP agents were
    configured on the same network,
    https://review.opendev.org/c/openstack/neutron/+/880957

    But it also changed the behavior to not enable IPv4
    metadata in this case. Restore the old behavior by
    not returning early in the DAD failure case. The callback
    that builds the config file was moved until after
    the address was bound to make the two steps more obvious.

    Conflicts:
        neutron/tests/unit/agent/metadata/test_driver.py

    Related-bug: #1953165
    Change-Id: I8436c6c9da9a2533ca27ff7312f5b2c7ea41e94f
    (cherry picked from commit 846003c4379124de6ffb3628ef1feb12a62a9cfa)
    (cherry picked from commit e7f85abae6a46a115582b80d1909e3565d859e9b)
    (cherry picked from commit 1a711f399abebff6572551ef4e3f7b92397caab5)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/victoria)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/885274
Committed: https://opendev.org/openstack/neutron/commit/31abe8bd1282467bbbfe5b6fea37b1f2ae559919
Submitter: "Zuul (22348)"
Branch: stable/victoria

commit 31abe8bd1282467bbbfe5b6fea37b1f2ae559919
Author: Brian Haley <email address hidden>
Date: Mon May 15 12:29:42 2023 -0400

    Start metadata proxy even if IPv6 DAD fails

    A recent change suppressed the IPv6 DAD failure and
    removed the address when multiple DHCP agents were
    configured on the same network,
    https://review.opendev.org/c/openstack/neutron/+/880957

    But it also changed the behavior to not enable IPv4
    metadata in this case. Restore the old behavior by
    not returning early in the DAD failure case. The callback
    that builds the config file was moved until after
    the address was bound to make the two steps more obvious.

    Conflicts:
        neutron/tests/unit/agent/metadata/test_driver.py

    Related-bug: #1953165
    Change-Id: I8436c6c9da9a2533ca27ff7312f5b2c7ea41e94f
    (cherry picked from commit 846003c4379124de6ffb3628ef1feb12a62a9cfa)
    (cherry picked from commit e7f85abae6a46a115582b80d1909e3565d859e9b)
    (cherry picked from commit 1a711f399abebff6572551ef4e3f7b92397caab5)

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
Mohammed Naser (mnaser) wrote :
Revision history for this message
Mohammed Naser (mnaser) wrote :

I'm also finding the following in the logs:

```
2023-07-11 22:45:59.727 39 INFO neutron.agent.metadata.driver [-] DAD failed for address fe80::a9fe:a9fe on interface tapb7e3a48f-56 in namespace qdhcp-85b9a333-dd2a-4354-b09c-74e35138596b on network 85b9a333-dd2a-4354-b09c-74e35138596b, deleting it. Exception: Failure waiting for address fe80::a9fe:a9fe to become ready: Duplicate address detected
```

Revision history for this message
Mohammed Naser (mnaser) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.