Bug #1988069 “neutron-dhcp-agent fails when small tenant network...” : Bugs : neutron

Revision history for this message

Michael Sherman (msherman-uchicago) wrote on 2022-08-29:

#1

A related issue is that it's not currently possible to set min/max allowed MTU for tenant networks, see https://bugs.launchpad.net/neutron/+bug/1859362

Otherwise we could work around this by setting the minimum tenant network MTU above this threshold.

Revision history for this message

Brian Haley (brian-haley) wrote on 2022-08-30:

#2

So I could not reproduce the failure you had in your paste (invalid parameter when trying to configure IP address) when I tried this on the master branch.

DHCP would not work on the instance because the MTU was too small to transmit the request:

Starting network: udhcpc: started, v1.29.3
udhcpc: sending discover
udhcpc: sendto: Message too long
udhcpc: sending discover
udhcpc: sendto: Message too long
udhcpc: sending discover
udhcpc: sendto: Message too long

I'd also assume the response would have never fit in 70 bytes either. This makes me think we need to enforce some MTU minimum since the network/subnet is useless otherwise, but outside that the resolution is to use a larger MTU (or the default).

I will set this as Medium just because I could not trigger any DHCP outage as described above, if you can supply any more information on reproducing that I will revisit.

Changed in neutron:
importance:	Undecided → Medium
status:	New → Incomplete

Revision history for this message

Michael Sherman (msherman-uchicago) wrote on 2022-08-31:

#3

Thank you!

DHCP not working for a network where the tenant has set a small MTU is totally expected. In terms of usability, I would (as a user) expect neutron to refuse to enable DHCP on a subnet if the MTU is too low, rather than it being "enabled", but not working. The network is still "useful" for L2 traffic, or L3 with static IPs.

To replicate the issue affecting other networks, I was able to produce the error when running kernel `5.4.0-120-generic #136-Ubuntu`, but not with kernel `5.4.0-122-generic #138-Ubuntu`, both with the Xena commit mentioned above. I have not tested yet with master.

Revision history for this message

Brian Haley (brian-haley) wrote on 2022-08-31:

#4

Yes, the network is still usable since it could be configured some other way (config drive, manually). You can't run IPv6 on it, and it's right near the 68-byte IPv4 minimum.

The other bug you linked about min/max network MTU would be useful here, although there are cases already where you can overflow the default DHCP response size with things like static route options. In the end the IPv6 minimum of 1280 seems a good place to start.

As a reference I tested this on Ubuntu 20.04 with the 5.15.0-46 kernel. Perhaps there was a kernel bug in some versions that triggered the invalid argument? If that is the case then I'm not sure there's a bug here and we can continue discussion in the linked MTU bug.

Revision history for this message

Michael Sherman (msherman-uchicago) wrote on 2022-09-14:

#5

While I haven't been able to rule out a kernel bug yet, I do feel that there's a more general bug in error handling for neutron-dhcp-agent, in that a failure to configure DHCP in one netns should not impact others.

I have seen this general behavior of a netns-specific failure breaking dhcp globally in two cases
1. This issue, where a kernel version + MTU triggers the failure
2. https://bugs.launchpad.net/neutron/+bug/1953165 , the IPV6 address conflicts from that bug caused dhcp failures even in namespaces without conflicts.

Revision history for this message

Launchpad Janitor (janitor) wrote on 2022-11-14:

#6

[Expired for neutron because there has been no activity for 60 days.]

Changed in neutron:
status:	Incomplete → Expired

Revision history for this message

Zakhar Kirpichenko (kzakhar) wrote on 2023-02-15:

#7

We just hit this issue in a Wallaby deployment (manual package deployment, Ubuntu 20.04, kernel 5.4, linux bridge). Neutron-dhcp-agent fails with the following error:

neutron.privileged.agent.linux.ip_lib.InvalidArgument: Invalid parameter/value used on interface X, namespace qdhcp-Y"

whenever the network MTU is < 1280 bytes. Comparing the interface X of the namespace Y to properly functioning interfaces of other namespaces shows that IPv6 configuration is missing from the interface X.

DHCP agents remain broken globally until the network is removed or adjusted to MTU >= 1280.

Changed in neutron:
status:	Expired → Opinion
status:	Opinion → Confirmed

Revision history for this message

Brian Haley (brian-haley) wrote on 2023-02-15:

#8

Unfortunately we cannot prevent a user from mis-configuring their network, such as setting the MTU below 1280 and adding an IPv6 subnet. The best we can do is add a note/warning to the admin guide with this information and hope the user reads it.

Moving to Low for this reason.

Changed in neutron:
importance:	Medium → Low

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-02-15: Fix proposed to neutron (master)

#9

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/874036

Changed in neutron:
status:	Confirmed → In Progress

Brian Haley (brian-haley) on 2023-02-15

Changed in neutron:
assignee:	nobody → Brian Haley (brian-haley)

Revision history for this message

Michael Sherman (msherman-uchicago) wrote on 2023-02-15:

#10

Isn't it still a large impact if a single user can, either in error, or maliciously, break DHCP for all subnets sharing a DHCP agent?

Last time this came up for us, I was able to work around it as only some kernel versions (see up-thread) are affected, but it would be great if the "blast radius" of a broken DHCP process could be limited to the misconfigured network/namespace.

Revision history for this message

Brian Haley (brian-haley) wrote on 2023-02-16:

#11

As I see it there are two issues here.

1) A broken kernel. Neutron can't do anything about that, it would be up to the admin to install one with a fix and reboot. There have been issues before with bad kernels, all we can do is highlight they're broken so others don't use them.

2) A mis-configured network. This is user error, and since it's something (presumably) owned by a single project, any "blast radius" only involves a single user. Something like a provider network would be owned by an admin, and a normal user would not have rights to add a subnet to it and cause an issue. The best we can do sometimes is warn not to do something, but we can't change the API to not allow a small MTU.

Unless I'm missing something?

Revision history for this message

Zakhar Kirpichenko (kzakhar) wrote on 2023-02-16 (last edit on 2023-02-16):

#12

I'd like to clarify several things:

1) I cannot confirm a "broken kernel" suggestion. I tried with several 5.4 kernels from Ubuntu 20.04, including the versions stated earlier as well as older and newer versions, and they all behaved exactly the same. Thus far it seems to me that an interface with MTU<1280 cannot get IPv6 configuration because IPv6 requires MTU>=1280 by design. I'm not sure about this, but it may be an expected kernel behavior. I will try kernel 5.15 today and see whether anything changes.

2) I am not a Python developer or a Neutron developer in particular, but I managed to find the relevant part in ml2 plugin code and prevent it from creating networks with MTU<1280 similarly to how it prevents creating networks with MTU>MAX_MTU. It was a rather ugly hack with a hardcoded minimum value though, so I'm wondering whether there's a better way.

3) The "blast radius" is not limited to a single user/tenant or a single project, but is system-wide. A non-admin user, for example a member of a tenant, is able to cause a denial of service by creating an internal DHCP-enabled network with MTU<1280. In our deployment with 3 infra nodes and redundant DHCP agents, within a few seconds of creating such network neutron-dhcp-agent instances fail on all infra nodes, enter an error-loop and are unable to process configuration changes. Everything that relies on DHCP configuration changes, including attachments of new ports in/to other DHCP-enabled networks of all other tenants, including the service networks of a service tenant such as for example LBaaS management, stops working in this scenario until the user's network is removed or adjusted to have MTU>1280. This seems like a rather high-priority issue to me.

Revision history for this message

Michael Sherman (msherman-uchicago) wrote on 2023-02-16:

#13

Hi Brian,

Thanks for breaking it down into issues 1 and 2, I think I can respond a bit more clearly.
I agree that:
For 1, if the issue is only a broken kernel, that has a clear workaround and seems to be a non-issue, as mentioned.
For 2, in the case where the system is not impacted by (1), the blast radius is indeed restricted to a single network, and is again a non-issue since it's self-inflicted.

However, my issue lies in a perceived lack of "robustness" or error handling in the dhcp-agent, as I seem to have observed the following cases where some kind of error propagated outside the boundaries of a single network:
1. My original issue in this thread, setting a MTU below 280 + having ipv4 enabled, + kernel `5.4.0-120-generic #136-Ubuntu`, caused an error loop preventing dhcp agents from updating.
2. Zakhar's issue, with MTU below 1280 + IPV6 dhcp, with same symptoms
3. Issue https://bugs.launchpad.net/neutron/+bug/1953165, which although a different cause, has the same failure mode with the DHCP agent no longer processing updates.

My systems have thus far only been running a single networking node + dhcp agent, but Zakhar reports the issue propagating across multiple infrastructure nodes?

To me it seems that if more robust error handling could wrap the interaction between neutron-dhcp-agent, and this category of system error, it would reduce the severity of the above-mentioned failure cases. This is admittedly a naive proposition, and maybe that's impractical!

Thanks again for your attention on this.

Revision history for this message

Zakhar Kirpichenko (kzakhar) wrote on 2023-02-16:

#14

Actually my issue is not caused by IPv6 DHCP, the offending network is IPv4 only. DHCP agents start dnsmasq with IPv4 options only, but when MTU is lower than 1280 whatever the agents do - before starting dnsmasq for that network - fails, and the agents enter an error-loop.

The reason why all DHCP agents are affected in our deployment is that we run the agents in HA mode (dhcp_agents_per_network=3) on 3 infra nodes, i.e. all agents run the same configuration and fail the same way.

Revision history for this message

Brian Haley (brian-haley) wrote on 2023-02-16:

#15

Hi,

Thanks for the comments.

So I guess I always thought it was either a kernel issue, or an issue spawning dnsmasq for an individual network, which would only impact a single network. After I saw Zakhar's comment this morning I did this on a local setup and see the agent goes into a loop, which is worse than I expected.

I will take a closer look at the code and see how we can mitigate this. Unfortunately just requiring a 1280+ network MTU will break our API agreement with users. It could be the agent can detect and ignore this case, but then we're in that place where the user doesn't exactly know they mis-configured things.

Changed in neutron:
importance:	Low → High

Revision history for this message

Brian Haley (brian-haley) wrote on 2023-02-16:

#16

Zakhar - can you provide the stack trace for that failure? And output of 'openstack network show...' and 'openstack subnet show...' for the involved pieces? Without an IPv6 subnet my dhcp-agent starts right up.

Revision history for this message

Zakhar Kirpichenko (kzakhar) wrote on 2023-02-16 (last edit on 2023-02-16):

#17

Download full text (10.0 KiB)

Brian,

Many thanks for your response. Here's the information you requested:

neutron-dhcp-agent stack trace:

2023-02-16 19:11:47.896 49507 ERROR neutron.agent.dhcp.agent [-] Unable to enable dhcp for aee895a8-acda-420c-9c58-6a936c9a4102.: neutron.privileged.agent.linux.ip_lib.InvalidArgument: Invalid parameter/value used on interface ns-beed7d37-b4, namespace qdhcp-aee895a8-acda-420c-9c58-6a936c9a4102.
2023-02-16 19:11:47.896 49507 ERROR neutron.agent.dhcp.agent Traceback (most recent call last):
2023-02-16 19:11:47.896 49507 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3/dist-packages/neutron/agent/dhcp/agent.py", line 227, in call_driver
2023-02-16 19:11:47.896 49507 ERROR neutron.agent.dhcp.agent rv = getattr(driver, action)(**action_kwargs)
2023-02-16 19:11:47.896 49507 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3/dist-packages/neutron/agent/linux/dhcp.py", line 266, in enable
2023-02-16 19:11:47.896 49507 ERROR neutron.agent.dhcp.agent common_utils.wait_until_true(self._enable, timeout=300)
2023-02-16 19:11:47.896 49507 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3/dist-packages/neutron/common/utils.py", line 708, in wait_until_true
2023-02-16 19:11:47.896 49507 ERROR neutron.agent.dhcp.agent while not predicate():
2023-02-16 19:11:47.896 49507 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3/dist-packages/neutron/agent/linux/dhcp.py", line 278, in _enable
2023-02-16 19:11:47.896 49507 ERROR neutron.agent.dhcp.agent interface_name = self.device_manager.setup(self.network)
2023-02-16 19:11:47.896 49507 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3/dist-packages/neutron/agent/linux/dhcp.py", line 1770, in setup
2023-02-16 19:11:47.896 49507 ERROR neutron.agent.dhcp.agent self.driver.init_l3(interface_name, ip_cidrs,
2023-02-16 19:11:47.896 49507 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3/dist-packages/neutron/agent/linux/interface.py", line 153, in init_l3
2023-02-16 19:11:47.896 49507 ERROR neutron.agent.dhcp.agent device.addr.add(ip_cidr)
2023-02-16 19:11:47.896 49507 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3/dist-packages/neutron/agent/linux/ip_lib.py", line 536, in add
2023-02-16 19:11:47.896 49507 ERROR neutron.agent.dhcp.agent add_ip_address(cidr, self.name, self._parent.namespace, scope,
2023-02-16 19:11:47.896 49507 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3/dist-packages/neutron/agent/linux/ip_lib.py", line 821, in add_ip_address
2023-02-16 19:11:47.896 49507 ERROR neutron.agent.dhcp.agent privileged.add_ip_address(
2023-02-16 19:11:47.896 49507 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3/dist-packages/oslo_privsep/priv_context.py", line 247, in _wrap
2023-02-16 19:11:47.896 49507 ERROR neutron.agent.dhcp.agent return self.channel.remote_call(name, args, kwargs)
2023-02-16 19:11:47.896 49507 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3/dist-packages/oslo_privsep/daemon.py", line 224, in remote_call
2023-02-16 19:11:47.896 49507 ERROR neutron.agent.dhcp.agent raise exc_type(*result[2])
2023-02-16 19:11:47.896 49507 ERROR neutron.agent.dhcp.agent neutron.privileged.agent.linux.ip_l...

Brian,

Many thanks for your response. Here's the information you requested:

neutron-dhcp-agent stack trace:

2023-02-16 19:11:47.896 49507 ERROR neutron.agent.dhcp.agent [-] Unable to enable dhcp for aee895a8-acda-420c-9c58-6a936c9a4102.: neutron.privileged.agent.linux.ip_lib.InvalidArgument: Invalid parameter/value used on interface ns-beed7d37-b4, namespace qdhcp-aee895a8-acda-420c-9c58-6a936c9a4102.
2023-02-16 19:11:47.896 49507 ERROR neutron.agent.dhcp.agent Traceback (most recent call last):
2023-02-16 19:11:47.896 49507 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3/dist-packages/neutron/agent/dhcp/agent.py", line 227, in call_driver
2023-02-16 19:11:47.896 49507 ERROR neutron.agent.dhcp.agent     rv = getattr(driver, action)(**action_kwargs)
2023-02-16 19:11:47.896 49507 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3/dist-packages/neutron/agent/linux/dhcp.py", line 266, in enable
2023-02-16 19:11:47.896 49507 ERROR neutron.agent.dhcp.agent     common_utils.wait_until_true(self._enable, timeout=300)
2023-02-16 19:11:47.896 49507 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3/dist-packages/neutron/common/utils.py", line 708, in wait_until_true
2023-02-16 19:11:47.896 49507 ERROR neutron.agent.dhcp.agent     while not predicate():
2023-02-16 19:11:47.896 49507 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3/dist-packages/neutron/agent/linux/dhcp.py", line 278, in _enable
2023-02-16 19:11:47.896 49507 ERROR neutron.agent.dhcp.agent     interface_name = self.device_manager.setup(self.network)
2023-02-16 19:11:47.896 49507 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3/dist-packages/neutron/agent/linux/dhcp.py", line 1770, in setup
2023-02-16 19:11:47.896 49507 ERROR neutron.agent.dhcp.agent     self.driver.init_l3(interface_name, ip_cidrs,
2023-02-16 19:11:47.896 49507 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3/dist-packages/neutron/agent/linux/interface.py", line 153, in init_l3
2023-02-16 19:11:47.896 49507 ERROR neutron.agent.dhcp.agent     device.addr.add(ip_cidr)
2023-02-16 19:11:47.896 49507 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3/dist-packages/neutron/agent/linux/ip_lib.py", line 536, in add
2023-02-16 19:11:47.896 49507 ERROR neutron.agent.dhcp.agent     add_ip_address(cidr, self.name, self._parent.namespace, scope,
2023-02-16 19:11:47.896 49507 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3/dist-packages/neutron/agent/linux/ip_lib.py", line 821, in add_ip_address
2023-02-16 19:11:47.896 49507 ERROR neutron.agent.dhcp.agent     privileged.add_ip_address(
2023-02-16 19:11:47.896 49507 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3/dist-packages/oslo_privsep/priv_context.py", line 247, in _wrap
2023-02-16 19:11:47.896 49507 ERROR neutron.agent.dhcp.agent     return self.channel.remote_call(name, args, kwargs)
2023-02-16 19:11:47.896 49507 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3/dist-packages/oslo_privsep/daemon.py", line 224, in remote_call
2023-02-16 19:11:47.896 49507 ERROR neutron.agent.dhcp.agent     raise exc_type(*result[2])
2023-02-16 19:11:47.896 49507 ERROR neutron.agent.dhcp.agent neutron.privileged.agent.linux.ip_lib.InvalidArgument: Invalid parameter/value used on interface ns-beed7d37-b4, namespace qdhcp-aee895a8-acda-420c-9c58-6a936c9a4102.
2023-02-16 19:11:47.896 49507 ERROR neutron.agent.dhcp.agent
2023-02-16 19:11:47.897 49507 INFO neutron.agent.dhcp.agent [-] Finished network aee895a8-acda-420c-9c58-6a936c9a4102 dhcp configuration
2023-02-16 19:11:47.897 49507 INFO neutron.agent.dhcp.agent [req-50988b6b-7d62-40e8-bceb-8cb9d0f5fbdc - - - - -] Synchronizing state complete
2023-02-16 19:11:47.901 49507 INFO neutron.agent.dhcp.agent [req-50988b6b-7d62-40e8-bceb-8cb9d0f5fbdc - - - - -] Synchronizing state

Network and subnet info:

I also noticed that during the interface creation neutron-linuxbridge-agent fails to disable IPv6 for this network's interface because the interface doesn't have IPv6 due to low MTU. Unlike neutron-dhcp-agent, neutron-linuxbridge-agent doesn't stop functioning after this:

2023-02-16 19:11:22.397 6780 INFO neutron.agent.securitygroups_rpc [req-9e33893e-4797-4714-a4c6-dc870ef9ae87 - - - - -] Preparing filters for devices {'tapbeed7d37-b4'}
2023-02-16 19:11:22.772 6780 INFO neutron.plugins.ml2.drivers.agent._common_agent [req-9e33893e-4797-4714-a4c6-dc870ef9ae87 - - - - -] Port tapbeed7d37-b4 updated. Details: {'device': 'tapbeed7d37-b4', 'network_id': 'aee895a8-acda-420c-9c58-6a936c9a4102', 'port_id': 'beed7d37-b486-476f-b79a-60640b04907e', 'mac_address': 'fa:16:3e:f0:19:1c', 'admin_state_up': True, 'network_type': 'vxlan', 'segmentation_id': 4224, 'physical_network': None, 'mtu': 1279, 'fixed_ips': [{'subnet_id': 'dd919723-f118-49c3-a1af-3c8d12ea2330', 'ip_address': '10.10.10.4'}], 'device_owner': 'network:dhcp', 'allowed_address_pairs': [], 'port_security_enabled': False, 'qos_policy_id': None, 'network_qos_policy_id': None, 'profile': {}, 'propagate_uplink_status': False}
2023-02-16 19:11:22.789 6780 INFO neutron.plugins.ml2.drivers.linuxbridge.agent.arp_protect [req-9e33893e-4797-4714-a4c6-dc870ef9ae87 - - - - -] Skipping ARP spoofing rules for port 'tapbeed7d37-b4' because it has port security disabled
2023-02-16 19:11:22.827 6780 ERROR neutron.agent.linux.utils [req-9e33893e-4797-4714-a4c6-dc870ef9ae87 - - - - -] Exit code: 255; Cmd: ['sysctl', '-w', 'net.ipv6.conf.vxlan-4224.disable_ipv6=1']; Stdin: ; Stdout: ; Stderr: sysctl: cannot stat /proc/sys/net/ipv6/conf/vxlan-4224/disable_ipv6: No such file or directory
2023-02-16 19:11:22.828 6780 WARNING neutron.agent.linux.ip_lib [req-9e33893e-4797-4714-a4c6-dc870ef9ae87 - - - - -] Setting ['sysctl', '-w', 'net.ipv6.conf.vxlan-4224.disable_ipv6=1'] in namespace None failed: Exit code: 255; Cmd: ['sysctl', '-w', 'net.ipv6.conf.vxlan-4224.disable_ipv6=1']; Stdin: ; Stdout: ; Stderr: sysctl: cannot stat /proc/sys/net/ipv6/conf/vxlan-4224/disable_ipv6: No such file or directory
.: neutron_lib.exceptions.ProcessExecutionError: Exit code: 255; Cmd: ['sysctl', '-w', 'net.ipv6.conf.vxlan-4224.disable_ipv6=1']; Stdin: ; Stdout: ; Stderr: sysctl: cannot stat /proc/sys/net/ipv6/conf/vxlan-4224/disable_ipv6: No such file or directory

Pastebin of the above: https://pastebin.com/2LFVktiX

I hope this helps.

Revision history for this message

Brian Haley (brian-haley) wrote on 2023-02-16:

#18

The disable_ipv6 sysctl is interesting, but that is all caught so doesn't trigger the agent to loop luckily, so I don't think we need to deal with it.

One of the only reasons I can think a v4-only network fails for you is that you have either force_metadata or enable_isolated_metadata set to True in your config, since that would try and add the IPv6 metadata address and fail in this way. Unfortunately it doesn't log what it was trying to add when it fails so I can't tell. If you set the mtu to 1280 what addresses does it show on the interface?

I'll post a hack that seems to work for me, but it will need more work and discussion with others to see if it's a good plan. Still don't think tweaking the API call(s) to return a HttpConflict code is a good idea though.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-02-16:

#19

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/874167

Revision history for this message

Zakhar Kirpichenko (kzakhar) wrote on 2023-02-17:

#20

Download full text (3.9 KiB)

Brian,

The DHCP agent is configured as follows:

# grep -Ev "^#|^$" /etc/neutron/dhcp_agent.ini
[DEFAULT]
interface_driver = linuxbridge
dhcp_driver = neutron.agent.linux.dhcp.Dnsmasq
enable_isolated_metadata = true
[agent]
availability_zone = openstack-network
[ovs]

I.e. enable_isolated_metadata is enabled (recommended setting: https://docs.openstack.org/neutron/wallaby/install/controller-install-option1-ubuntu.html), force_metadata is disabled (default setting). If we disable isolated metadata, what is going to break?

A low-MTU network has just an IPv4 address and no IPv6 link-local addresses, IPv6 is disabled:

# ip netns exec qdhcp-c9b96063-3f8f-41b2-9783-ac73d37894b0 ip a li
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: ns-bbd05a6d-0b@if675: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1279 qdisc noqueue state UP group default qlen 1000
    link/ether fa:16:3e:3d:ee:c4 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.10.10.4/24 brd 10.10.10.255 scope global ns-bbd05a6d-0b
       valid_lft forever preferred_lft forever

# ls /proc/sys/net/ipv6/conf/vxlan-3947/
ls: cannot access '/proc/sys/net/ipv6/conf/vxlan-3947/': No such file or directory

"Healthy" networks have IPv6 link-local addresses and IPv6 is enabled:

# ip netns exec qdhcp-8abc13db-565b-4640-9507-819d6ef520ef ip a li
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: ns-fc876933-90@if90: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether fa:16:3e:36:d5:40 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.10.10.2/24 brd 10.10.10.255 scope global ns-fc876933-90
       valid_lft forever preferred_lft forever
    inet 169.254.169.254/32 brd 169.254.169.254 scope global ns-fc876933-90
       valid_lft forever preferred_lft forever
    inet6 fe80::a9fe:a9fe/64 scope link
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fe36:d540/64 scope link
       valid_lft forever preferred_lft forever

# ls /proc/sys/net/ipv6/conf/vxlan-4952/
accept_dad accept_ra_rtr_pref drop_unsolicited_na max_desync_factor router_probe_interval temp_prefered_lft
accept_ra accept_redirects enhanced_dad mc_forwarding router_solicitation_delay temp_valid_lft
accept_ra_defrtr accept_source_route force_mld_version mldv1_unsolicited_report_interval router_solicitation_interval use_oif_addrs_only
accept_ra_from_local addr_gen_mode force_tllao mldv2_unsolicited_report_interval router_solicitation_max_interval u...

Brian,

The DHCP agent is configured as follows:

# grep -Ev "^#|^$" /etc/neutron/dhcp_agent.ini
[DEFAULT]
interface_driver = linuxbridge
dhcp_driver = neutron.agent.linux.dhcp.Dnsmasq
enable_isolated_metadata = true
[agent]
availability_zone = openstack-network
[ovs]

I.e. enable_isolated_metadata is enabled (recommended setting: https://docs.openstack.org/neutron/wallaby/install/controller-install-option1-ubuntu.html), force_metadata is disabled (default setting). If we disable isolated metadata, what is going to break?

A low-MTU network has just an IPv4 address and no IPv6 link-local addresses, IPv6 is disabled:

# ip netns exec qdhcp-c9b96063-3f8f-41b2-9783-ac73d37894b0 ip a li
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: ns-bbd05a6d-0b@if675: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1279 qdisc noqueue state UP group default qlen 1000
    link/ether fa:16:3e:3d:ee:c4 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.10.10.4/24 brd 10.10.10.255 scope global ns-bbd05a6d-0b
       valid_lft forever preferred_lft forever

# ls /proc/sys/net/ipv6/conf/vxlan-3947/
ls: cannot access '/proc/sys/net/ipv6/conf/vxlan-3947/': No such file or directory

"Healthy" networks have IPv6 link-local addresses and IPv6 is enabled:

# ip netns exec qdhcp-8abc13db-565b-4640-9507-819d6ef520ef ip a li
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: ns-fc876933-90@if90: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether fa:16:3e:36:d5:40 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.10.10.2/24 brd 10.10.10.255 scope global ns-fc876933-90
       valid_lft forever preferred_lft forever
    inet 169.254.169.254/32 brd 169.254.169.254 scope global ns-fc876933-90
       valid_lft forever preferred_lft forever
    inet6 fe80::a9fe:a9fe/64 scope link
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fe36:d540/64 scope link
       valid_lft forever preferred_lft forever

# ls /proc/sys/net/ipv6/conf/vxlan-4952/
accept_dad                  accept_ra_rtr_pref            drop_unsolicited_na          max_desync_factor                  router_probe_interval             temp_prefered_lft
accept_ra                   accept_redirects              enhanced_dad                 mc_forwarding                      router_solicitation_delay         temp_valid_lft
accept_ra_defrtr            accept_source_route           force_mld_version            mldv1_unsolicited_report_interval  router_solicitation_interval      use_oif_addrs_only
accept_ra_from_local        addr_gen_mode                 force_tllao                  mldv2_unsolicited_report_interval  router_solicitation_max_interval  use_tempaddr
accept_ra_min_hop_limit     autoconf                      forwarding                   mtu                                router_solicitations
accept_ra_mtu               dad_transmits                 hop_limit                    ndisc_notify                       seg6_enabled
accept_ra_pinfo             disable_ipv6                  ignore_routes_with_linkdown  ndisc_tclass                       seg6_require_hmac
accept_ra_rt_info_max_plen  disable_policy                keep_addr_on_down            proxy_ndp                          stable_secret
accept_ra_rt_info_min_plen  drop_unicast_in_l2_multicast  max_addresses                regen_max_retry                    suppress_frag_ndisc

Pastebin of the above: https://pastebin.com/ihXedc0U

Revision history for this message

Brian Haley (brian-haley) wrote on 2023-02-17:

#21

Thanks for the info, and it confirms the trigger for you is "enable_isolated_metadata = true" since it will add the IPv6 metadata address (fe80::a9fe:a9fe) to be configured, which will cause the failure and loop. My hack would fix that as well so let me have a discussion with other maintainers on it.

Revision history for this message

Zakhar Kirpichenko (kzakhar) wrote on 2023-02-17:

#22

Many thanks for looking into this, Brian. May I ask what the chances are for this hack/fix to make it into Wallaby?

Revision history for this message

Brian Haley (brian-haley) wrote on 2023-02-17:

#23

Since it looks like Day 1 bug we will backport a fix.

Just adding a note here that I had a discussion in our IRC channel and we decided to do the following:

1) Change the code that adds IP(v6) addresses to look at the MTU and fail more gracefully. For both IPv4 and IPv6.

2) Detect the MTU is invalid in API calls, for example when adding an IPv6 subnet to a network, or changing the MTU on a network, and return an error (409 or 400?).

3) Document to "not do this" :)

I've already started 1 and 3, will see how far I get doing updates today.

Revision history for this message

Zakhar Kirpichenko (kzakhar) wrote on 2023-02-17:

#24

Thanks again, I very much appreciate your effort!

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-03-01:

#25

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/875809

Brian Haley (brian-haley) on 2023-04-13

tags:

added: antelope-backport-potential

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-04-24: Change abandoned on neutron (master)

#26

Change abandoned by "Brian Haley <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/874036
Reason: Superceded by https://review.opendev.org/c/openstack/neutron/+/875809

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-04-24:

#27

Change abandoned by "Brian Haley <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/874167
Reason: Superceded by https://review.opendev.org/c/openstack/neutron/+/875809

Revision history for this message

Zakhar Kirpichenko (kzakhar) wrote on 2023-04-25:

#28

Hi again! What's going on with this issue?

Revision history for this message

Rodolfo Alonso (rodolfo-alonso-hernandez) wrote on 2023-04-25:

#29

Hello Zakhar:

You can check in gerrit the status of the patches. The main patch https://review.opendev.org/c/openstack/neutron/+/875809 has been merged.

Regards.

Revision history for this message

Zakhar Kirpichenko (kzakhar) wrote on 2023-04-25:

#30

Thanks!

Revision history for this message

Gregory Orange (gregoryo2017) wrote on 2023-04-25:

#31

I am seeing a discrepancy between the problem stated in the Bug Description, and the patch merged. I will heavily simplify below, referring only to IPv4.

Problem stated: e.g. Network created with 70 MTU causes DHCP to fail on other networks with higher MTU.

Patch merged: MTU is required to be at least 68, else an error.

One can hopefully see why the patch as I've described it would not fix the problem as I've stated it. I assume that I am missing something. Would someone be willing to explain this to me?

Thank you,
Greg.

Revision history for this message

Rodolfo Alonso (rodolfo-alonso-hernandez) wrote on 2023-04-25:

#32

The patch fixing this issue makes a change in the API (a limitation). We can't backport this patch because of this.

That was discussed during the Neutron team meeting today [1].

[1]https://meetings.opendev.org/meetings/networking/2023/networking.2023-04-25-14.01.log.html#l-226

tags:

removed: antelope-backport-potential

Revision history for this message

Brian Haley (brian-haley) wrote on 2023-04-25:

#33

Greg - if you look at my first patch, https://review.opendev.org/c/openstack/neutron/+/874167 - you will see it was pretty invasive. I felt it was better to prevent an invalid MTU in the API call(s) instead.

I tested this with a network MTU down to 68 for IPv4, 1280 for IPv6, and didn't see any problems, so 70 should work fine. If you have a chance to try the change please let me know if there is still an issue and I will address it.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-05-04: Fix merged to neutron (master)

#34

Reviewed: https://review.opendev.org/c/openstack/neutron/+/875809
Committed: https://opendev.org/openstack/neutron/commit/88ce859b568248a0ee2f47a5d91c1708b774d20e
Submitter: "Zuul (22348)"
Branch: master

commit 88ce859b568248a0ee2f47a5d91c1708b774d20e
Author: Brian Haley <email address hidden>
Date: Wed Mar 1 00:52:38 2023 -0500

Change API to validate network MTU minimums

    A network's MTU is now only valid if it is the minimum value
    allowed based on the IP version of the associated subnets,
    68 for IPv4 and 1280 for IPv6.

This minimum is now enforced in the following ways:

    1) When a subnet is associated with a network, validate
       the MTU is large enough for the IP version. Not only
       would the subnet be unusable if it was allowed, but the
       Linux kernel can fail adding addresses and configuring
       network settings like the MTU.

    2) When a network MTU is changed, validate the MTU is large
       enough for any currently associated subnets. Allowing a
       smaller MTU would render any existing subnets unusable.

Closes-bug: #1988069
Change-Id: Ia4017a8737f9a7c63945df546c8a7243b2673ceb

Changed in neutron:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-05-17: Fix included in openstack/neutron 23.0.0.0b2

#35

This issue was fixed in the openstack/neutron 23.0.0.0b2 development milestone.

Revision history for this message

Andy Gomez (agomerz) wrote on 2023-09-11:

#36

Download full text (8.9 KiB)

I have just run into this issue running Wallaby though the MTU on a network was set to 128.

This prevented any new ports on other networks sharing this DHCP agent from being put into ACTIVE status.
The ports would be stuck in BUILD status. Until The MTU of the offending network was increased.

openstack port list --network 58dc3b69-2c46-4f6b-ae03-a7de7aeb709b
+--------------------------------------+----------+-------------------+----------------------------------------------------------------------------+--------+
| ID | Name | MAC Address | Fixed IP Addresses | Status |
+--------------------------------------+----------+-------------------+----------------------------------------------------------------------------+--------+
| 20f950e2-7ce3-402e-a209-27f9b25dd7f3 | | fa:16:3e:fc:76:74 | ip_address='10.32.0...

neutron

neutron-dhcp-agent fails when small tenant network mtu is set

Bug Description

Other bug subscribers

Remote bug watches