[SRU] iPXE doesn't handle NAK requests when multiple DHCP server's offer

Bug #1707999 reported by Jason Hobbs
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Invalid
Critical
Blake Rouse
2.2
Invalid
Critical
Unassigned
ipxe (Ubuntu)
Fix Released
Critical
Christian Ehrhardt 
Xenial
Fix Released
Critical
Unassigned
Zesty
Invalid
Critical
Unassigned
Artful
Fix Released
Undecided
Unassigned
Bionic
Fix Released
Undecided
Unassigned

Bug Description

[Impact]
When there are multiple DHCP servers on the network, iPXE doesn't handle NAK's for the DHCP servers. This causes iPXE to get blocked without attempting to re-discover, hence, never obtaining an IP address.

For example, in a MAAS HA environment with a DHCP master/slave configuration, the machine fails to PXE boot because at a certain point, the DHCP server is not fully in sync, which causes iPXE to get a NAK request. This prevents the machine from PXE booting.

[Test case]
The easiest way:
1. Install MAAS with two rack controllers
2. Configure HA
3. PXE boot KVM's.

[Regression Potential]
Minimal. This only ensures that iPXE attempts to re-discover the network when it receives a NACK.

[Original bug report]
A VM failed to PXE boot after receiving multiple DHCP offers.

You can see this here on a log from the secondary controller:
http://paste.ubuntu.com/25221939/

The node is offered both 10.245.208.201 and 10.245.208.120, tries to get 10.245.208.120, and is refused.

One strange thing is that it seems like the DHCP server on both the primary controller and the secondary controller are responding. The primary controller's log doesn't have the offer for 10.245.208.120 - only the offer for 10.245.208.201:
http://paste.ubuntu.com/25221952/

This is in an HA setup: region API's are at 10.245.208.30, 10.245.208.31 and 10.245.208.32. We're using hacluster to load balance, and a VIP in front at 10.245.208.33. There are rack controllers on 10.245.208.30 and 10.245.208.31. For the untagged vlan this VM is trying to boot from, 10.245.208.30 is set as the primary controller, and 10.245.208.31 is set as the secondary.

Primary postgres is on 10.245.208.30, it's being replicated to backup postgres on 10.245.208.31. It has a VIP at 10.245.208.34.

We don't hit this everytime - on this deployment only one machine out of about 30 hit this.

We've also seen this on single node MAAS setups - non HA. So, it's not an HA specific issue.

I've attached logs from the maas servers.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :
Revision history for this message
Chris Gregan (cgregan) wrote :

We need to investigate network setting (MTU, Switch config)

Changed in maas:
status: New → Incomplete
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

What network/mtu settings/switch config do we need to investigate? What is the theory there?

Changed in maas:
status: Incomplete → New
tags: added: foundations-engine
removed: foundation-engine
Changed in maas:
milestone: none → 2.3.0
Revision history for this message
Andres Rodriguez (andreserl) wrote :

Hi Jason,

I remember issues with KVM in the past that based on the way the bridge was configured, it would cause some VM's to not boot from MAAS. I can't fully recall what these were and cannot find the bug reported at the time, but it could be related to the stp config on the bridge maybe?

Changed in maas:
status: New → Incomplete
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

No, stp is not enabled on the bridge.

Changed in maas:
status: Incomplete → New
Revision history for this message
Andres Rodriguez (andreserl) wrote :

Right, but I dont recall if that was the exact fix. So, to figure out what is wrong, can you please get the console log for all the machines that are failing to boot so we can have an idea of what's wrong?

Without the console log it will be difficult for us to determine what may be wrong.

Changed in maas:
status: New → Incomplete
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

You can see clearly from the logs I've posted that it's receiving two dhcp offers, trying to get one, getting NAK'd, and giving up. What would the console logs help with there?

Changed in maas:
status: Incomplete → New
Revision history for this message
Chris Gregan (cgregan) wrote :

@jhobbes
The suspicion is that this might be caused by HA. Additional environment information and contextual information is needed to properly triage this bug.

1) Does it happen in non-HA mode
2) What changes have been made to the bridging
3) Could the issue be releated to: https://bugs.launchpad.net/ubuntu/+source/maas/+bug/1246236/comments/32

Revision history for this message
Andres Rodriguez (andreserl) wrote :

Alright, so after discussion in the cross team, I wanted to record what was discussed:

1. You are using an HA MAAS.
2. You are manually modifying the VM's created by MAAS to use a bridge, which is not part of the Pod support in MAAS.

Since you are doing (2), which is non-supported for MAAS pods, it is really difficult for us to determine what the problem is. That said, since you are doing (1), what we would like information from you is:

1. Disable HA, only use single region/rack controller and try to reproduce the issue.

If the issue is reproducible, then we can safely assume it is an issue with the custom bridge you are creating. If the issue is not reproducible, then it could be related to HA or it could a bug in ISC-DHCP.

Also, please see https://bugs.launchpad.net/ubuntu/+source/maas/+bug/1246236/comments/32 for issues with bridges and KVM's.

Changed in maas:
status: New → Incomplete
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

We haven't hit this anytime recently. I don't know how to reproduce this regularly.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

We hit this again last night. This is an HA only issue, it's very clear from the logs - it has something to do with the second DHCP server handing out a lease for some reason.

52:54:00:50:fd:5b is the MAC address of the KVM node that failed to PXE boot

Here is syslog from the primary rack controller:

http://paste.ubuntu.com/25579251/

You can see that node is getting an IP from the secondary rack server:

syslog:Sep 20 07:10:17 infra1 dhcpd[24849]: DHCPDISCOVER from 52:54:00:50:fd:5b via broam: load balance to peer failover-vlan-5022
syslog:Sep 20 07:10:18 infra1 dhcpd[24849]: DHCPREQUEST for 10.245.208.170 (10.245.208.31) from 52:54:00:50:fd:5b via broam: lease owned by peer
syslog:Sep 20 07:10:18 infra1 dhcpd[24849]: DHCPREQUEST for 10.245.208.170 (10.245.208.31) from 52:54:00:50:fd:5b via broam: lease owned by peer
syslog:Sep 20 07:10:19 infra1 maas.interface: [info] eno2 (physical) on infra1: New MAC, IP binding observed: 52:54:00:50:fd:5b, 10.245.208.170

Later on it tries to request that IP again and the primary rack server offers it a different IP. It, however, wants the IP the secondary rack server is offering:
syslog:Sep 20 07:53:20 infra1 dhcpd[24849]: DHCPDISCOVER from 52:54:00:50:fd:5b via broam
syslog:Sep 20 07:53:20 infra1 dhcpd[24849]: DHCPOFFER on 10.245.208.222 to 52:54:00:50:fd:5b via broam
syslog:Sep 20 07:53:21 infra1 dhcpd[24849]: DHCPDISCOVER from 52:54:00:50:fd:5b via broam
syslog:Sep 20 07:53:23 infra1 dhcpd[24849]: DHCPREQUEST for 10.245.208.170 (10.245.208.31) from 52:54:00:50:fd:5b via broam: lease 10.245.208.170 unavailable.

Here's syslog from the secondary rack controller, showing it offering the 10.245.208.170 address at the same time the primary rack controller is offering the 10.245.208.22 address:

http://paste.ubuntu.com/25579275/

The two offers:
<secondary> Sep 20 07:53:21 infra2 dhcpd[20205]: DHCPOFFER on 10.245.208.170 to 52:54:00:50:fd:5b via broam
<primary> Sep 20 07:53:21 infra1 dhcpd[24849]: DHCPOFFER on 10.245.208.222 to 52:54:00:50:fd:5b

So the question is, why is the secondary rack server offering an IP? Why is the "load balance to peer" thing happening?

Changed in maas:
status: Incomplete → New
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Sorry, in the paste above infra2 was the primary and infra1 was the secondary.

I disabled the the secondary rack controller for the vlan and it booted fine:
http://paste.ubuntu.com/25579371/

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

After reenabling the secondary rack controller for the vlan, it continues to work fine. Before that, I could reproduce it every time. It seems like the dhcpd servers are getting into some bad state that is corrected by disabling and reenabling the secondary rack controller's dhcp.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Oops, pasted the wrong log in #12. http://paste.ubuntu.com/25579457/

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

maas logs from the latest reproduce.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

dhcpd.conf from the secondary: http://paste.ubuntu.com/25580043/

and from the primary: http://paste.ubuntu.com/25580048/

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

another reproduce.

syslog from secondary: http://paste.ubuntu.com/25580683/
and primary: http://paste.ubuntu.com/25580691/

Revision history for this message
Ante Karamatić (ivoks) wrote :

Andres, I don't think this is related to VMs in any shape or form. STP and PortFast are a thing that happens with physical networks too and is not any factor here.

Logs in comment #11 indicate that machine did boot at 07:09. It then, most probably, ran dhcp client within the operating system at 07:10 and got a different IP. It should have gotten the same one (unless PXE pool is different than normal pool).

But later on, both dhcp servers refuse to renew .170 IP. It's like it was given to someone else in the meantime? Why does MAAS observe an IP it gave to a machine. Could the problem be there somehow?

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

after the reproduce i restarted maas-dhcpd on the primary controller and PXE booting worked again.

Here's the log from that:
http://paste.ubuntu.com/25580768/

The interesting lines there are:
Sep 20 16:36:40 infra2 maas.dhcp: [error] Could not remove host map for 52:54:00:1d:43:20: The DHCP server could not be reached.
Sep 20 16:36:40 infra2 maas.dhcp: [warn] Failed to update all host maps. Restarting DHCPv4 service to ensure host maps are in-sync.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

full syslog from the primary: http://paste.ubuntu.com/25580785/
and the secondary: http://paste.ubuntu.com/25580795/

Revision history for this message
Jason Hobbs (jason-hobbs) wrote : Re: pod VM fails to PXE boot after receiving multiple DHCP offers from both primary and secondary rack controllers, for different IPs

Another reproduction - in this case the node received offers from both primary and secondary at the same time, took the offer from the secondary, and then made a tftprequest for pxelinux.0 from the secondary, but then didn't progress further:

http://paste.ubuntu.com/25593132/

Why are both primary and secondary offering IP addresses?

summary: - pod VM fails to PXE boot after receiving multiple DHCP offers
+ pod VM fails to PXE boot after receiving multiple DHCP offers from both
+ primary and secondary rack controllers, for different IPs
Changed in maas:
importance: Undecided → High
status: New → Triaged
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Alright, learning a little about more about DHCP.

The reason both dhcp servers are responding is because this setting:

"load balance max seconds 3"

This means that if the SECS value in the dhcp request header is greater than 3, the secondary server will respond, regardless of load balancing settings.

Here's a tcpdump of a dhcp discover/offer sequence captured the interface the secondary server provides dhcp on:

http://paste.ubuntu.com/25595132/

There are two discover packets received - one with a value of 4 and the next with a value of 8. This is kind of weird because they are captured about one second apart.

It's also weird that we don't see any dhcp requests before that - tcpdump was started well in advance of the node booting.

It also doesn't explain why sometimes the dhcp servers reject the request with a dhcpnak, leading to failed boots, and sometimes they accept it.

A workaround for us may be to increase the value of "load balance max seconds" to something over what we're seeing here - like 10 or 15. It could have negative consequences if the primary dhcp server can't respond for some reason, but the secondary doesn't know it's down, and the client times out its request before getting to 10 or 15 seconds.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

A dhcp discover/response sequence from the same VM on the primary controller. The primary controller is providing dhcp on 'broam', which is also the bridge the VM that is PXE booting is on, so you see the dhcp responses from both servers there:

http://paste.ubuntu.com/25595239/

It also doesn't see any requests with secs < 4. It's interesting that both dhcp servers respond after the secs = 4 request, but the client seems to ignore them and does another request with secs = 10.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

ipxe and isc dhcpd have different meanings of the value of the secs field, and that's leading to the secondary dhcp server responding in a time frame where they normally wouldn't.

The reason the secs field grows faster than the actual number of elapsed seconds is that ipxe (where the pxe rom for qemu booting with seabios comes from) abuses (in their own words) the field to convey debug information, rather than the actual amount of time passed:

https://github.com/ipxe/ipxe/blob/74d90b33f8490adcee2026ece18d8411d93b6a39/src/net/udp/dhcp.c#L1129

so a value of "4" means the packet count is 1, not that 4 seconds have gone by.

It also uses shorter timeouts than the PXE standard calls for:
 * The PXE spec indicates discover request are sent 4 times, with
 * timeouts of 4, 8, 16, 32 seconds. iPXE by default uses 1, 2, 4, 8.

https://github.com/ipxe/ipxe/blob/74d90b33f8490adcee2026ece18d8411d93b6a39/src/config/dhcp.h#L22

So the value of "4" in the header means it's either the first or second dhcp discover request (I can't tell if the counting starts at 0 or 1), and 1 second later we see the second, which is marked 10, which is 2 << 2 | 0x2, meaning it's the second (or 3rd?) transmission and that it's seen an offer before.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

The reason the first round of offers is ignored is ipxe waits for proxy DHCP offers until at least 2 seconds have elapsed since its first request, unless the offer instructs it to ignore proxy DHCP offers, which our response doesn't.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Ah, we're using ipxe-qemu 1.0.0+git-20150424.a25a16d-1ubuntu1, which is from ipxe source in april of 2015. They've changed the way the secs field is computed since then - in current source it's '0' for the first request; in the version we're using it's '4' for the first request.

https://github.com/ipxe/ipxe/blob/a25a16d4adf663abd8f6eaab266b9444157357b6/src/net/udp/dhcp.c#L1080

So we are seeing all of the dhcp requests at the dhcp servers, and the first one will always have secs = 4, which will always trigger the secondary dhcp server to respond.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Setting this option:

option ipxe.no-pxedhcp 1;

will cause ipxe to respond to the first offer, speeding up the boot.

see http://ipxe.org/howto/dhcpd for more details.

Setting "load balance max seconds" to 5 instead of 3 will make the secondary dhcp server ignore the first request, so that only the primary will make an offer, which combined with the no-pxedhcp option above, ipxe will respond to.

Changed in maas:
milestone: 2.3.0 → 2.3.0beta2
Changed in maas:
importance: High → Critical
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

I modified the dhcpd.conf.template to include the settings I talked about above:

http://paste.ubuntu.com/25617177/

Those settings are getting picked up, but not fully having the desired affect.

The ipxe option doesn't seem to be getting included in the offer reply, and the secondary server is still offering even when secs < load balance max secs. I wonder if it offers always if it already has a lease written for the ip?

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Here's an annotated and collated set of logs from the primary and secondary dhcp server for a node booting. It shows some strange behavior, including the secondary dhcpd NAK'ing requests made to the primary dhcpd. The "max load balance secs" setting change did have some effect, but it doesn't seem to work all the time.

http://pastebin.ubuntu.com/25617320/

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

In the pastebin from comment #31, the 10.245.209.7 address comes from the host entry maas is creating in dhcpd.conf. The entry gets created on both the primary and secondary server. Perhaps if it's there, the server will respond, even if it's secondary?

The next question is why the primary dhcp server didn't offer 10.245.209.7 (the auto assigned IP) during the "second pxe boot" - when the node was booting to deploy. It instead offered 10.245.208.168 and added a lease entry for it:

http://paste.ubuntu.com/25617392/

I wonder if for some reason its dhcpd.conf hadn't been rewritten with the host entry for 52:54:00:af:49:b8?

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Here's where the IP allocation is supposed to happen, 10 seconds before the first DHCPDISCOVER request comes in. How quickly do dhcpd.conf's get updated? Is that guaranteed to happen before a node begins the PXE boot process? Maybe why we only see this with VMs is because VMs start PXE booting so much faster?

Sep 25 20:58:13 infra3 maas.interface: [info] Allocated automatic IP address 10.245.209.7 for ens4 (physical) on thruk-3.
Sep 25 20:58:14 infra3 maas.node: [info] thruk-3: Status transition from ALLOCATED to DEPLOYING

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

my current theory:

node deploy starts, machine is powered on
dhcpd.conf on secondary is written with host entry for node
node issues dhcpdiscovery
primary responds with dynamic ip and writes lease
secondary responds with ip from host entry
node requests the dynamic ip offered by primary
secondary (not sure why it's responding!) NAKs the request
dhcpd.conf on primary is written with host entry for node

but I don't know enough about how dhcpd.conf gets written to know if that's possible

Revision history for this message
Blake Rouse (blake-rouse) wrote :

Jason,

I think your theory is correct.

Context on how it works:
We do write the hostmaps into the config but dhcpd is not reloaded (isc-dhcpd doesn't support reload) we using the OMAPI to update the hostmaps so no to bounce the DHCP server. The hostmap is only written to the dhcpd.conf just incase the leases file is deleted and dhcpd is restarted.

The updating of this information is out of band, from the machine being turned on. Meaning that MAAS does not block until hostmap is written before powering the machine on.

We will will need to look into how we can verify that its written before actually powering the machine on. I also think you are correct in that normal machines do not have this problem because they take long enough to power up before it makes its first PXE request.

Really thanks for digging into this and figuring out what was really going wrong. The hard part now is fixing it.

Thanks!

Revision history for this message
Jason Hobbs (jason-hobbs) wrote : Re: [Bug 1707999] Re: pod VM fails to PXE boot after receiving multiple DHCP offers from both primary and secondary rack controllers, for different IPs

Great! Thanks Blake

On Tue, Sep 26, 2017 at 11:12 AM, Blake Rouse <email address hidden>
wrote:

> Jason,
>
> I think your theory is correct.
>
> Context on how it works:
> We do write the hostmaps into the config but dhcpd is not reloaded
> (isc-dhcpd doesn't support reload) we using the OMAPI to update the
> hostmaps so no to bounce the DHCP server. The hostmap is only written to
> the dhcpd.conf just incase the leases file is deleted and dhcpd is
> restarted.
>
> The updating of this information is out of band, from the machine being
> turned on. Meaning that MAAS does not block until hostmap is written
> before powering the machine on.
>
> We will will need to look into how we can verify that its written before
> actually powering the machine on. I also think you are correct in that
> normal machines do not have this problem because they take long enough
> to power up before it makes its first PXE request.
>
> Really thanks for digging into this and figuring out what was really
> going wrong. The hard part now is fixing it.
>
> Thanks!
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1707999
>
> Title:
> pod VM fails to PXE boot after receiving multiple DHCP offers from
> both primary and secondary rack controllers, for different IPs
>
> Status in MAAS:
> Triaged
>
> Bug description:
> A VM failed to PXE boot after receiving multiple DHCP offers.
>
> You can see this here on a log from the secondary controller:
> http://paste.ubuntu.com/25221939/
>
> The node is offered both 10.245.208.201 and 10.245.208.120, tries to
> get 10.245.208.120, and is refused.
>
> One strange thing is that it seems like the DHCP server on both the
> primary controller and the secondary controller are responding. The
> primary controller's log doesn't have the offer for 10.245.208.120 - only
> the offer for 10.245.208.201:
> http://paste.ubuntu.com/25221952/
>
> This is in an HA setup: region API's are at 10.245.208.30,
> 10.245.208.31 and 10.245.208.32. We're using hacluster to load
> balance, and a VIP in front at 10.245.208.33. There are rack
> controllers on 10.245.208.30 and 10.245.208.31. For the untagged vlan
> this VM is trying to boot from, 10.245.208.30 is set as the primary
> controller, and 10.245.208.31 is set as the secondary.
>
> Primary postgres is on 10.245.208.30, it's being replicated to backup
> postgres on 10.245.208.31. It has a VIP at 10.245.208.34.
>
> We don't hit this everytime - on this deployment only one machine out
> of about 30 hit this.
>
> I've attached logs from the maas servers.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1707999/+subscriptions
>

Revision history for this message
Jason Hobbs (jason-hobbs) wrote : Re: pod VM fails to PXE boot after receiving multiple DHCP offers from both primary and secondary rack controllers, for different IPs

We saw this today on a single node MAAS setup - no HA:

http://paste.ubuntu.com/25634735/

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

It seems like fixing bug 1460816 could help with this issue too. It wouldn't address the root cause but would make MAAS recover on its own here instead of needing user intervention.

Chris Gregan (cgregan)
tags: added: cdo-release-blocker
summary: - pod VM fails to PXE boot after receiving multiple DHCP offers from both
- primary and secondary rack controllers, for different IPs
+ pod VM fails to PXE boot after receiving multiple DHCP offers, for
+ different IPs, from the dhcp server
description: updated
tags: added: internal
Changed in maas:
milestone: 2.3.0beta2 → 2.3.0beta3
Revision history for this message
Blake Rouse (blake-rouse) wrote : Re: pod VM fails to PXE boot after receiving multiple DHCP offers, for different IPs, from the dhcp server

The issue is that iPXE doesn't handle DHCPNACK at all. It doesn't change the DHCP state.

If needs to check for NACK, here: https://github.com/ipxe/ipxe/blob/master/src/net/udp/dhcp.c#L560 and set: `dhcp_set_state ( dhcp, &dhcp_state_discover );`.

It seems that fixing this in MAAS is not the correct thing to do, as the DHCP server is sending a NACK but the client is not handling it.

Changed in ipxe (Ubuntu):
importance: Undecided → Critical
Changed in maas:
status: Triaged → Incomplete
Changed in maas:
assignee: nobody → Blake Rouse (blake-rouse)
Changed in maas:
status: Incomplete → Invalid
Changed in ipxe (Ubuntu):
assignee: nobody → Andres Rodriguez (andreserl)
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Blake, nice work on the patch, it appears that it worked!

Here's a snippet of my syslog where you can see the machine requesting an IP, getting NAK'd, then going back to discover state and successfully completing dhcp with a new ip.

http://paste.ubuntu.com/25728345/

Revision history for this message
Andres Rodriguez (andreserl) wrote :
tags: added: patch
David Britton (dpb)
Changed in ipxe (Ubuntu):
assignee: Andres Rodriguez (andreserl) → ChristianEhrhardt (paelzer)
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

The patch is built in ppa:blake-rouse/ipxe

Revision history for this message
David Britton (dpb) wrote :

Christian -- Andres is planning on uploading this, but could use a review first to make sure it is sane from a qemu/kvm perspective.

Revision history for this message
Andres Rodriguez (andreserl) wrote :

As David said, and to clarify, I was planning on uploading this patch to the archive, but I wanted to double check with you guys to ensure that the implementation is sane from your perspective, as this will affect ipxe clients that receive DHCPNACKs.

Also, it would be good to get a different perspective from the RFC standpoint.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Hi Andres, Blake and Jason,
reading this long context trying to help avaluating the change.

Summarizing (mostly for myself) c#34 and c#35 means that you think it is "ok" due to the out-of-band update to have time windows where primary and secondary will offer different IPs.
Instead of (trying to) fix this part of the dhcp setup to onyl ever offer one (or at least offer the same from primary and secondary) the suggestion is to make iPXE aware of how to handle the NACK.
The patch in c#41 and tests in c#40 suggest this works fine for your case now.
Overall that was already a lot of work to debug it down, kudos to all of you.

Taking a look on this from the qemu POV - probably especially since the last ipxe upload broke all migrations :-) (due to a size change).

One thing I wondered, but think we are good are state loops.
Since we add a state transition back (that didn't exist before at all) it might need some sort of infinite loop protection. I tried to read all that code but might be wrong.
D: dhcp_state_discover; R: dhcp_state_request
init: -> D -> R
         ^---/ (this is what we add)
As I read it, this will continue as before, neiher _expiry function had a responsibiity to abort he overall discovery. Each state change resets it's timer, but that should be ok. Also a scenario where a guest repeatetly gets an offer but then Nacked should be invalid in general. So it should (IMHO) be ok to add this transition.

In addition to the review I ran some tests (on Xenial as the ppa). On Artful sizes would now be checked on build, but the ppa is only xenial - so I checked manually and they are ok. Further some upgrade & migration test I ran were all without issues - ok.

The open comments/requests to you that I'd have are:
1. Trivial: current patch has (only) whitespace damage in hunk #2
2. Upstreaming: While we can't wait (or could we) with the fix for an upstream accept it should be forwarded
3. Formalisms in the patch header:
  3.1 After being submitted please add as usual add to the dep3 header a Forwarded statement
      with a link to where you sent it.
  3.2 We all know how hard backtracing in a few years can be, please add like
      Bug-Ubuntu: https://bugs.launchpad.net/bugs/1707999

Other than that I like it and would consider it clearly good enough to go from on-bug-RFC to ipxe-upstream-RFC with the patch.

Revision history for this message
Blake Rouse (blake-rouse) wrote :

I have submitted the patch upstream: https://github.com/ipxe/ipxe/pull/66

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package ipxe - 1.0.0+git-20161027.b991c67+really20150424.a25a16d-1ubuntu2

---------------
ipxe (1.0.0+git-20161027.b991c67+really20150424.a25a16d-1ubuntu2) artful; urgency=medium

  * debian/patches/handle-dhcp-nack.patch: Handle DHCP NAK and send a
    re-discover. (LP: #1707999)

 -- Andres Rodriguez <email address hidden> Thu, 12 Oct 2017 16:25:52 -0400

Changed in ipxe (Ubuntu):
status: New → Fix Released
Changed in ipxe (Ubuntu Bionic):
status: New → Fix Released
Revision history for this message
Christian Reis (kiko) wrote :

Can someone summarize what is blocking the Xenial incarnation of this bug in moving ahead?

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Nothing from my POV, I expected Andres to propose or upload the same fix for X/Z now as he did for Artful last month.

Btw I'll mark Artful correctly and add X/Z tasks.

Changed in ipxe (Ubuntu Artful):
status: New → Fix Released
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Since the code is essentially the same it should likely apply as is.
If one expected that I take over and do the uploads please tell me so.

Changed in ipxe (Ubuntu Zesty):
status: New → Triaged
Changed in ipxe (Ubuntu Xenial):
status: New → Triaged
summary: - pod VM fails to PXE boot after receiving multiple DHCP offers, for
- different IPs, from the dhcp server
+ [SRU] iPXE doesn't handle NAK requests when multiple DHCP server's offer
description: updated
Changed in ipxe (Ubuntu Zesty):
importance: Undecided → Critical
Changed in ipxe (Ubuntu Xenial):
importance: Undecided → Critical
Revision history for this message
Brian Murray (brian-murray) wrote : Please test proposed package

Hello Jason, or anyone else affected,

Accepted ipxe into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/ipxe/1.0.0+git-20150424.a25a16d-1ubuntu1.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-xenial to verification-done-xenial. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-xenial. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in ipxe (Ubuntu Xenial):
status: Triaged → Fix Committed
tags: added: verification-needed verification-needed-xenial
Changed in ipxe (Ubuntu Zesty):
status: Triaged → Fix Committed
tags: added: verification-needed-zesty
Revision history for this message
Brian Murray (brian-murray) wrote :

Hello Jason, or anyone else affected,

Accepted ipxe into zesty-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/ipxe/1.0.0+git-20150424.a25a16d-1ubuntu2.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-zesty to verification-done-zesty. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-zesty. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

I tested ipxe-qemu 1.0.0+git-20150424.a25a16d-1ubuntu1.2 overnight on a xenial system by repeatedly deploying and releasing about 15 VMs.

I reproduced the original failure case and the fixed package handled it appropriately.

Example results: http://paste.ubuntu.com/26028916/

tags: added: verification-done-xenial
removed: verification-needed-xenial
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

The package build for zesty failed. It needs to be rebuilt. However, since it's the exact same package as for xenial, I'm not sure it needs separate verification.

I'm going to mark it failed for now though and get more input from more SRU experienced folks.

tags: added: verification-failed-zesty
removed: verification-needed-zesty
Revision history for this message
Brian Murray (brian-murray) wrote :

It will need separate verification for zesty once the build failure is fixed.

Revision history for this message
Andres Rodriguez (andreserl) wrote :

This is the part of the buildlog... looks like may be unrelated to the change.

arch/x86/drivers/xen/hvm.c arch/x86/drivers/hyperv/hyperv.c arch/i386/drivers/net/undirom.c arch/i386/drivers/net/undinet.c arch/i386/drivers/net/undionly.c arch/i386/drivers/net/undi.c arch/i386/drivers/net/undiload.c arch/i386/drivers/net/undipreload.c arch/i386/drivers/net/undiisr.S > bin/.rom.defs
Can't redeclare "my" in "my" at ./util/parserom.pl line 160, near ", "
Execution of ./util/parserom.pl aborted due to compilation errors.
make[2]: *** Deleting file 'bin/.rom.defs'
gcc -DARCH=i386 -DPLATFORM=pcbios -march=i386 -fomit-frame-pointer -fstrength-reduce -falign-jumps=1 -falign-loops=1 -falign-functions=1 -mpreferred-stack-boundary=2 -mregparm=3 -mrtd -freg-struct-return -m32 -fshort-wchar -Ui386 -Ulinux -DNVALGRIND -Iinclude -I. -Iarch/x86/include -Iarch/i386/include -Iarch/i386/include/pcbios -Os -g -ffreestanding -Wall -W -Wformat-nonliteral -fno-stack-protector -fno-dwarf2-cfi-asm -fno-exceptions -fno-unwind-tables -fno-asynchronous-unwind-tables -Wno-address -fno-PIE -no-pie -ffunction-sections -fdata-sections -include include/compiler.h -DOBJECT=version -DBUILD_NAME="\"82540em.rom\"" \
 -DVERSION_MAJOR=1 \
 -DVERSION_MINOR=0 \
 -DVERSION_PATCH=0 \
 -DVERSION="\"1.0.0+git-20150424.a25a16d-1ubuntu2.1\"" \
 -c core/version.c -o bin/version.82540em.rom.o
In file included from ./config/general.h:172:0,
                 from core/version.c:35:
./config/local/general.h:1:0: warning: "ROM_BANNER_TIMEOUT" redefined
 #define ROM_BANNER_TIMEOUT 0

In file included from core/version.c:35:0:
./config/general.h:30:0: note: this is the location of the previous definition
 #define ROM_BANNER_TIMEOUT ( 2 * BANNER_TIMEOUT )

ld -m elf_i386 -N --no-check-sections --gc-sections -static -T arch/i386/scripts/i386.lds -u _rom_start --defsym check__rom_start=_rom_start -u obj_82540em --defsym check_obj_82540em=obj_82540em -u obj_config --defsym check_obj_config=obj_config --defsym pci_vendor_id=0 --defsym pci_device_id=0 -e _rom_start bin/version.82540em.rom.o bin/blib.a -o bin/82540em.rom.tmp \
 --defsym _build_id=`perl -e 'printf "0x%08x", int ( rand ( 0xffffffff ) );'` \
 --defsym _build_timestamp=1511794740 \
 -Map bin/82540em.rom.tmp.map
--defsym:2: undefined symbol `obj_82540em' referenced in expression
Makefile.housekeeping:1100: recipe for target 'bin/82540em.rom.tmp' failed
make[2]: *** [bin/82540em.rom.tmp] Error 1
rm bin/version.82540em.rom.o
make[2]: Leaving directory '/<<BUILDDIR>>/ipxe-1.0.0+git-20150424.a25a16d/src'
debian/rules:18: recipe for target 'src/bin/82540em.rom' failed
make[1]: *** [src/bin/82540em.rom] Error 2
make[1]: Leaving directory '/<<BUILDDIR>>/ipxe-1.0.0+git-20150424.a25a16d'
debian/rules:12: recipe for target 'build' failed
make: *** [build] Error 2
dpkg-buildpackage: error: debian/rules build gave error exit status 2
--------------------------------------------------------------------------------

Revision history for this message
Andres Rodriguez (andreserl) wrote :

I can confirm that ipxe is FTBFS in Zesty (without the patch).

Revision history for this message
Steve Langasek (vorlon) wrote :

Do you intend to resolve this build failure for zesty? Should the zesty SRU be withdrawn due to the short remaining support window (2 months)?

Under the circumstances I don't think fixing this in zesty should be a blocker for releasing it to xenial.

Google search for the failure message points to the related Debian bug, which includes a build system fix for the package. https://bugs.debian.org/839328

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

From our perspective we don't intend to use zesty ever, so the zesty SRU is not required. I would love for this to just go into xenial ASAP.

Revision history for this message
Andres Rodriguez (andreserl) wrote :

I've uploaded a new version of the package fixing the FTBFS using the patch from the Debian bug referenced above from Steve.

Debian fixed this issue by getting a new upstream checkout of iPXE.

Revision history for this message
Andres Rodriguez (andreserl) wrote :

Seems someone needs to reject/remove the previous upload (from proposed) so the new upload doesn't get rejected.

Revision history for this message
Steve Langasek (vorlon) wrote :

No, you need to upload with a new version number. You can't reuse a version number in launchpad.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote : Re: [Bug 1707999] Re: [SRU] iPXE doesn't handle NAK requests when multiple DHCP server's offer

Then we would need a new version for xenial too right, and have to
re-verify it? Can we just skip zesty?

On Wed, Nov 29, 2017 at 5:45 PM, Steve Langasek <
<email address hidden>> wrote:

> No, you need to upload with a new version number. You can't reuse a
> version number in launchpad.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1707999
>
> Title:
> [SRU] iPXE doesn't handle NAK requests when multiple DHCP server's
> offer
>
> Status in MAAS:
> Invalid
> Status in MAAS 2.2 series:
> Invalid
> Status in ipxe package in Ubuntu:
> Fix Released
> Status in ipxe source package in Xenial:
> Fix Committed
> Status in ipxe source package in Zesty:
> Fix Committed
> Status in ipxe source package in Artful:
> Fix Released
> Status in ipxe source package in Bionic:
> Fix Released
>
> Bug description:
> [Impact]
> When there are multiple DHCP servers on the network, iPXE doesn't handle
> NAK's for the DHCP servers. This causes iPXE to get blocked without
> attempting to re-discover, hence, never obtaining an IP address.
>
> For example, in a MAAS HA environment with a DHCP master/slave
> configuration, the machine fails to PXE boot because at a certain
> point, the DHCP server is not fully in sync, which causes iPXE to get
> a NAK request. This prevents the machine from PXE booting.
>
> [Test case]
> The easiest way:
> 1. Install MAAS with two rack controllers
> 2. Configure HA
> 3. PXE boot KVM's.
>
> [Regression Potential]
> Minimal. This only ensures that iPXE attempts to re-discover the network
> when it receives a NACK.
>
> [Original bug report]
> A VM failed to PXE boot after receiving multiple DHCP offers.
>
> You can see this here on a log from the secondary controller:
> http://paste.ubuntu.com/25221939/
>
> The node is offered both 10.245.208.201 and 10.245.208.120, tries to
> get 10.245.208.120, and is refused.
>
> One strange thing is that it seems like the DHCP server on both the
> primary controller and the secondary controller are responding. The
> primary controller's log doesn't have the offer for 10.245.208.120 - only
> the offer for 10.245.208.201:
> http://paste.ubuntu.com/25221952/
>
> This is in an HA setup: region API's are at 10.245.208.30,
> 10.245.208.31 and 10.245.208.32. We're using hacluster to load
> balance, and a VIP in front at 10.245.208.33. There are rack
> controllers on 10.245.208.30 and 10.245.208.31. For the untagged vlan
> this VM is trying to boot from, 10.245.208.30 is set as the primary
> controller, and 10.245.208.31 is set as the secondary.
>
> Primary postgres is on 10.245.208.30, it's being replicated to backup
> postgres on 10.245.208.31. It has a VIP at 10.245.208.34.
>
> We don't hit this everytime - on this deployment only one machine out
> of about 30 hit this.
>
> We've also seen this on single node MAAS setups - non HA. So, it's
> not an HA specific issue.
>
> I've attached logs from the maas servers.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1707999/+subscriptions
>

Revision history for this message
Andres Rodriguez (andreserl) wrote :
Download full text (5.5 KiB)

no need to re-verify xenial, this is only affected in zesty./

On Wed, Nov 29, 2017 at 12:10 PM, Jason Hobbs <email address hidden>
wrote:

> Then we would need a new version for xenial too right, and have to
> re-verify it? Can we just skip zesty?
>
> On Wed, Nov 29, 2017 at 5:45 PM, Steve Langasek <
> <email address hidden>> wrote:
>
> > No, you need to upload with a new version number. You can't reuse a
> > version number in launchpad.
> >
> > --
> > You received this bug notification because you are subscribed to the bug
> > report.
> > https://bugs.launchpad.net/bugs/1707999
> >
> > Title:
> > [SRU] iPXE doesn't handle NAK requests when multiple DHCP server's
> > offer
> >
> > Status in MAAS:
> > Invalid
> > Status in MAAS 2.2 series:
> > Invalid
> > Status in ipxe package in Ubuntu:
> > Fix Released
> > Status in ipxe source package in Xenial:
> > Fix Committed
> > Status in ipxe source package in Zesty:
> > Fix Committed
> > Status in ipxe source package in Artful:
> > Fix Released
> > Status in ipxe source package in Bionic:
> > Fix Released
> >
> > Bug description:
> > [Impact]
> > When there are multiple DHCP servers on the network, iPXE doesn't
> handle
> > NAK's for the DHCP servers. This causes iPXE to get blocked without
> > attempting to re-discover, hence, never obtaining an IP address.
> >
> > For example, in a MAAS HA environment with a DHCP master/slave
> > configuration, the machine fails to PXE boot because at a certain
> > point, the DHCP server is not fully in sync, which causes iPXE to get
> > a NAK request. This prevents the machine from PXE booting.
> >
> > [Test case]
> > The easiest way:
> > 1. Install MAAS with two rack controllers
> > 2. Configure HA
> > 3. PXE boot KVM's.
> >
> > [Regression Potential]
> > Minimal. This only ensures that iPXE attempts to re-discover the
> network
> > when it receives a NACK.
> >
> > [Original bug report]
> > A VM failed to PXE boot after receiving multiple DHCP offers.
> >
> > You can see this here on a log from the secondary controller:
> > http://paste.ubuntu.com/25221939/
> >
> > The node is offered both 10.245.208.201 and 10.245.208.120, tries to
> > get 10.245.208.120, and is refused.
> >
> > One strange thing is that it seems like the DHCP server on both the
> > primary controller and the secondary controller are responding. The
> > primary controller's log doesn't have the offer for 10.245.208.120 - only
> > the offer for 10.245.208.201:
> > http://paste.ubuntu.com/25221952/
> >
> > This is in an HA setup: region API's are at 10.245.208.30,
> > 10.245.208.31 and 10.245.208.32. We're using hacluster to load
> > balance, and a VIP in front at 10.245.208.33. There are rack
> > controllers on 10.245.208.30 and 10.245.208.31. For the untagged vlan
> > this VM is trying to boot from, 10.245.208.30 is set as the primary
> > controller, and 10.245.208.31 is set as the secondary.
> >
> > Primary postgres is on 10.245.208.30, it's being replicated to backup
> > postgres on 10.245.208.31. It has a VIP at 10.245.208.34.
> >
> > We don't hit this everytime - on this deployment only one machine o...

Read more...

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package ipxe - 1.0.0+git-20150424.a25a16d-1ubuntu1.2

---------------
ipxe (1.0.0+git-20150424.a25a16d-1ubuntu1.2) xenial-proposed; urgency=medium

  * debian/patches/handle-dhcp-nack.patch: Handle DHCP NAK and send a
    re-discover. (LP: #1707999)

 -- Andres Rodriguez <email address hidden> Mon, 20 Nov 2017 13:13:41 -0500

Changed in ipxe (Ubuntu Xenial):
status: Fix Committed → Fix Released
Revision history for this message
Brian Murray (brian-murray) wrote : Update Released

The verification of the Stable Release Update for ipxe has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

tags: added: verification-needed-zesty
removed: verification-failed-zesty
Revision history for this message
Brian Murray (brian-murray) wrote : Please test proposed package

Hello Jason, or anyone else affected,

Accepted ipxe into zesty-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/ipxe/1.0.0+git-20150424.a25a16d-1ubuntu2.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-zesty to verification-done-zesty. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-zesty. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Due to recent issues on ipxe changes I ran a set of regression tests that I usually run for qemu/libvirt against the ppa - found no regression.
That is no test of the actual case, therefore not updating tags yet.

Revision history for this message
Christian Reis (kiko) wrote :

Zesty went EOL before this got published I think, so marking Invalid.

Changed in ipxe (Ubuntu Zesty):
status: Fix Committed → Invalid
Revision history for this message
Michael Brown (mcb30) wrote :
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Thank you Michael, I can drop our fix when merging the next version then.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.