intermittent dhcp failures

Bug #1748894 reported by Laurent Sesques
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
neutron
Undecided
Unassigned
neutron (Ubuntu)
Undecided
Unassigned
Xenial
Undecided
Unassigned

Bug Description

Hi,

We have observed instances failing to get a DHCP reply, either when booting for the first time, or after a reboot.
By tcpdumping the traffic on the tap, then qvb and qvo interfaces, we can see the DHCP request leaving, but it doesn't reach the neutron-gateway node where the dhcp agent for the network runs.

If we remove and then re-add the network to the dhcp agent, it starts working fine again.

I have dumped the output of `ovs-ofctl dump-flows br-tun` on the neutron-gateway side, before and after doing the remove/add.
One difference I noted, was a rule appearing, with the mac of the VM which had the issue:
 cookie=0xb42e8d404b989e72, duration=13.344s, table=20, n_packets=0, n_bytes=0, hard_timeout=300, idle_age=13, priority=1,vlan_tci=0x0001/0x0fff,dl_dst=fa:16:3e:db:34:a3 actions=load:0->NXM_OF_VLAN_TCI[],load:0x20->NXM_NX_TUN_ID[],output:5201

However, this is a rule for packets from the dhcp agent to the VM, so its absence doesn't explain why DHCP requests don't arrive to destination.

The cloud is running neutron 8.4.0 (Mitaka), on ubuntu (16.04) machines.

Most of the time it works, but this problem comes back often enough that it impacts users.

Thanks.

Revision history for this message
Brian Haley (brian-haley) wrote :

Can you see if this happens on master? Mitaka is no longer supported upstream.

Changed in neutron:
status: New → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in neutron (Ubuntu Xenial):
status: New → Confirmed
Changed in neutron (Ubuntu):
status: New → Confirmed
Revision history for this message
Salman (salmankh) wrote :

This happens to our cluster as well which is on pike, but somewhat in different way. Since the cluster is HA, instance do able to get the ip from one of the dhcp namespaces but dnsmasq does not work later if the problematic namespace ends up first in resolv.conf. When we delete the specific port from the network which automatically gets created after a few seconds, things start working again.

Revision history for this message
Jack Ivanov (gunph1ld) wrote :

Same problem here, running Queens. DHCP packets are going out, but not reaching the server

Revision history for this message
Kevin Zhao (kevin-zhao) wrote :

Confirmed. OpenStack Rocky.
DHCP packets not reaching the server.

Debian Buster, Kernel 4.19/4.16/

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers