DHCP agent fails to fully configure DHCP namespaces because of duplicate address detected

Bug #1953165 reported by Pierre Riteau
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
neutron
Confirmed
High
Unassigned

Bug Description

After upgrading a Neutron/ML2 OVS deployment from Ussuri to Victoria, updating the host OS from CentOS Linux 8 to CentOS Stream 8, and rebooting, DHCP was not functional on some but not all networks.

DHCP agent logs included the following error multiple times:

2021-11-30 17:05:35.475 7 ERROR neutron.agent.dhcp.agent [-] Failure waiting for address fe80::a9fe:a9fe to become ready: Duplicate address detected: neutron.agent.linux.ip_lib.AddressNotReady: Failure waiting for address fe80::a9fe:a9fe to become ready: Duplicate address detected
2021-11-30 17:05:35.475 7 ERROR neutron.agent.dhcp.agent Traceback (most recent call last):
2021-11-30 17:05:35.475 7 ERROR neutron.agent.dhcp.agent File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/common/utils.py", line 165, in call
2021-11-30 17:05:35.475 7 ERROR neutron.agent.dhcp.agent return func(*args, **kwargs)
2021-11-30 17:05:35.475 7 ERROR neutron.agent.dhcp.agent File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/agent/dhcp/agent.py", line 401, in safe_configure_dhcp_for_network
2021-11-30 17:05:35.475 7 ERROR neutron.agent.dhcp.agent self.configure_dhcp_for_network(network)
2021-11-30 17:05:35.475 7 ERROR neutron.agent.dhcp.agent File "/var/lib/kolla/venv/lib/python3.6/site-packages/osprofiler/profiler.py", line 160, in wrapper
2021-11-30 17:05:35.475 7 ERROR neutron.agent.dhcp.agent result = f(*args, **kwargs)
2021-11-30 17:05:35.475 7 ERROR neutron.agent.dhcp.agent File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/agent/dhcp/agent.py", line 415, in configure_dhcp_for_network
2021-11-30 17:05:35.475 7 ERROR neutron.agent.dhcp.agent self.update_isolated_metadata_proxy(network)
2021-11-30 17:05:35.475 7 ERROR neutron.agent.dhcp.agent File "/var/lib/kolla/venv/lib/python3.6/site-packages/osprofiler/profiler.py", line 160, in wrapper
2021-11-30 17:05:35.475 7 ERROR neutron.agent.dhcp.agent result = f(*args, **kwargs)
2021-11-30 17:05:35.475 7 ERROR neutron.agent.dhcp.agent File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/agent/dhcp/agent.py", line 758, in update_isolated_metadata_proxy
2021-11-30 17:05:35.475 7 ERROR neutron.agent.dhcp.agent self.enable_isolated_metadata_proxy(network)
2021-11-30 17:05:35.475 7 ERROR neutron.agent.dhcp.agent File "/var/lib/kolla/venv/lib/python3.6/site-packages/osprofiler/profiler.py", line 160, in wrapper
2021-11-30 17:05:35.475 7 ERROR neutron.agent.dhcp.agent result = f(*args, **kwargs)
2021-11-30 17:05:35.475 7 ERROR neutron.agent.dhcp.agent File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/agent/dhcp/agent.py", line 816, in enable_isolated_metadata_proxy
2021-11-30 17:05:35.475 7 ERROR neutron.agent.dhcp.agent self.conf, bind_address=constants.METADATA_V4_IP, **kwargs)
2021-11-30 17:05:35.475 7 ERROR neutron.agent.dhcp.agent File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/agent/metadata/driver.py", line 271, in spawn_monitored_metadata_proxy
2021-11-30 17:05:35.475 7 ERROR neutron.agent.dhcp.agent ).wait_until_address_ready(address=bind_address_v6)
2021-11-30 17:05:35.475 7 ERROR neutron.agent.dhcp.agent File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/agent/linux/ip_lib.py", line 597, in wait_until_address_ready
2021-11-30 17:05:35.475 7 ERROR neutron.agent.dhcp.agent exception=AddressNotReady(address=address, reason=errmsg))
2021-11-30 17:05:35.475 7 ERROR neutron.agent.dhcp.agent File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/common/utils.py", line 701, in wait_until_true
2021-11-30 17:05:35.475 7 ERROR neutron.agent.dhcp.agent while not predicate():
2021-11-30 17:05:35.475 7 ERROR neutron.agent.dhcp.agent File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/agent/linux/ip_lib.py", line 591, in is_address_ready
2021-11-30 17:05:35.475 7 ERROR neutron.agent.dhcp.agent address=address, reason=_('Duplicate address detected'))
2021-11-30 17:05:35.475 7 ERROR neutron.agent.dhcp.agent neutron.agent.linux.ip_lib.AddressNotReady: Failure waiting for address fe80::a9fe:a9fe to become ready: Duplicate address detected
2021-11-30 17:05:35.475 7 ERROR neutron.agent.dhcp.agent

The tap interface inside each affected qdhcp namespace was in a state like this:

35: tap0f8bb343-c1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether fa:16:3e:ed:6f:60 brd ff:ff:ff:ff:ff:ff
    inet 169.254.169.254/32 brd 169.254.169.254 scope global tap0f8bb343-c1
       valid_lft forever preferred_lft forever
    inet 10.18.0.10/16 brd 10.18.255.255 scope global tap0f8bb343-c1
       valid_lft forever preferred_lft forever
    inet6 fe80::a9fe:a9fe/64 scope link dadfailed tentative
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:feed:6f60/64 scope link
       valid_lft forever preferred_lft forever

Note the dadfailed status on the fe80::a9fe:a9fe/64 address, which caused Neutron to raise an AddressNotReady exception.

I tried restarting dhcp-agent multiple times. Occasionally DHCP for one network would configure correctly, but most of the times the list of affected networks would stay the same.

I found out that removing the fe80::a9fe:a9fe/64 address from the tap interface of each affected namespace followed by restarting dhcp-agent fixed the issue: no more dadfailed status.

Version information:

* OpenStack Victoria deployed with Kolla source images
* neutron 17.2.2.dev70 (using stable/victoria from 2021-11-28)
* CentOS Stream release 8
* Linux kernel 4.18.0-348.2.1.el8_5.x86_64 #1 SMP Tue Nov 16 14:42:35 UTC 2021

Tags: ipv6
Pierre Riteau (priteau)
summary: - DHCP agent fails to configure DHCP namespaces because of duplicate
+ DHCP agent fails to fully configure DHCP namespaces because of duplicate
address detected
Revision history for this message
Brian Haley (brian-haley) wrote :

So this happens when you have more than one DHCP agent only, correct? Using isolated subnets?

It looks like an oversight when we added support for metadata over IPv6, since using the same link-local address on multiple nodes will fail in DAD as you show above.

Just thinking out loud there might be a couple of options:

1) Neutron tells only one DHCP agent to configure the IPv6 metadata address. It reduces availability, and there might be some edge cases, but could work.

2) We change to use an Anycast address, in which only one of the nodes will get the request. But this is more complicated as 1) Anycast addresses are only supposed to be configured on routers (which don't exist here); and 2) IANA assigns Anycast addresses, https://www.iana.org/assignments/ipv6-anycast-addresses/ipv6-anycast-addresses.xhtml

A quick fix for you would be to set this in neutron.conf:

dhcp_agents_per_network = 1

Changed in neutron:
importance: Undecided → High
status: New → Confirmed
tags: added: ipv6
Revision history for this message
Pierre Riteau (priteau) wrote :

This is actually on a deployment with three OpenStack controllers, each running dhcp-agent, but not in HA mode, so dhcp_agents_per_network is already at the default value of 1.

We also have a non-default setting: enable_isolated_metadata = true.

Most of the networks are tenant networks, but there are a few provider networks with Neutron DHCP enabled.

Revision history for this message
Brian Haley (brian-haley) wrote :

I know Bence is also looking at this, but one more question since I don't have a running devstack at the moment. You have enable_isolated_metadata set to True, but there is a router attached? If yes, does it's namespace have this IPv6 address configured as well? It might be a moot point would just be good to know. Thanks.

Revision history for this message
Bence Romsics (bence-romsics) wrote :

I suspect this may be a duplicate of:
https://bugs.launchpad.net/neutron/+bug/1930414

Revision history for this message
Bence Romsics (bence-romsics) wrote (last edit ):

Brian: If my suspicion is right that this is a duplicate of that other bug, then this is not metadata specific. It is actually not even IPv6 specific, but only IPv6 has default DAD detection which detects that dhcp ports may leak traffic shortly while they are plugged to the dead vlan. This could be confirmed by having multiple v6 subnets with the exact same range, so neutron chooses the exact same address for the dhcp ports and this should go to dadfailed just as well as the metadata address does.

But regarding your metadata related question: only the dhcp namespace should have the metadata address configured. IIRC in the router namespace we do not have the metadata address configured. Instead we catch that traffic by an iptables redirect to the loopback address where haproxy listens.

Revision history for this message
Kamil Madac (kamil-madac) wrote :

We experienced same bug last week as I described on mailing list http://lists.openstack.org/pipermail/openstack-discuss/2022-January/026484.html. This bug has severe consequences when dadfailed state is not noticied by operators.

When dhcp agent is restarted and there are dhcp namespaces with interfaces in dadfailed state, NetworkCache in dhcp agent is not updated with subnets, which causes that subsequent creation of VM or update of port of VM in such network will delete the namespace completely which then causes connectivity outage to all VMs in such network.

I think we should fix that if exception is raised in dhcp agent in configure_dhcp_for_network in update_isolated_metadata_proxy() function, self.cache.put(network) should be called in each case to ensure that NetworkCache is updated correctly and dhcp namespace won't be delete in next SyncState call.

Here is the code from agent.py which I'm talking about

    def configure_dhcp_for_network(self, network):
        if not network.admin_state_up:
            return

        for subnet in network.subnets:
            if subnet.enable_dhcp:
                if self.call_driver('enable', network):
                    self.update_isolated_metadata_proxy(network)
                    self.cache.put(network)
                    # After enabling dhcp for network, mark all existing
                    # ports as ready. So that the status of ports which are
                    # created before enabling dhcp can be updated.
                    self.dhcp_ready_ports |= {p.id for p in network.ports}
                break

Revision history for this message
Mark Goddard (mgoddard) wrote :

I had the same symptoms as this issue, with slightly different steps to reproduce.

On an existing wallaby deployment using neutron ML2/OVS, create a VLAN network & IPv4 subnet. The DHCP agent logs show the same backtrace as in the original description. Restarting the DHCP agent shows the same backtrace each time.

While DHCP seems to work, metadata does not.

There are two other networks which do not exhibit this issue.

Worked around the issue as suggested:

ip netns # find ID of network
ip netns exec qdhcp-dee9459f-7ed8-4627-9c42-4006ec09d5fd bash
ip a | grep dadfailed
ip a del fe80::a9fe:a9fe/64 dev tapf0cd099d-aa

FWIW, the system is Ubuntu 20.04 based, and is deployed via kayobe/kolla-ansible.

Revision history for this message
Stig Telfer (stigtelfer) wrote :

I don't think this bug is a duplicate of #1930414, or at least the fix for #1930414 (https://opendev.org/openstack/neutron/commit/9d5cea0e2bb85b3b6ea27eb71279c57c419b0102) does not fix this issue.

I have reproduced the issue on a Wallaby OpenStack deployment which has the backported fix applied.

My workaround is to disable IPV6 on the controller nodes via sysctl (net.ipv6.conf.all.disable_ipv6: 1)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers