Comment 23 for bug 1707999

Revision history for this message
Jason Hobbs (jason-hobbs) wrote : Re: pod VM fails to PXE boot after receiving multiple DHCP offers from both primary and secondary rack controllers, for different IPs

Alright, learning a little about more about DHCP.

The reason both dhcp servers are responding is because this setting:

"load balance max seconds 3"

This means that if the SECS value in the dhcp request header is greater than 3, the secondary server will respond, regardless of load balancing settings.

Here's a tcpdump of a dhcp discover/offer sequence captured the interface the secondary server provides dhcp on:

http://paste.ubuntu.com/25595132/

There are two discover packets received - one with a value of 4 and the next with a value of 8. This is kind of weird because they are captured about one second apart.

It's also weird that we don't see any dhcp requests before that - tcpdump was started well in advance of the node booting.

It also doesn't explain why sometimes the dhcp servers reject the request with a dhcpnak, leading to failed boots, and sometimes they accept it.

A workaround for us may be to increase the value of "load balance max seconds" to something over what we're seeing here - like 10 or 15. It could have negative consequences if the primary dhcp server can't respond for some reason, but the secondary doesn't know it's down, and the client times out its request before getting to 10 or 15 seconds.