MAAS

Bug #1707999
Comment #36

Comment 36 for bug 1707999

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2017-09-26: Re: [Bug 1707999] Re: pod VM fails to PXE boot after receiving multiple DHCP offers from both primary and secondary rack controllers, for different IPs

#36

Great! Thanks Blake

On Tue, Sep 26, 2017 at 11:12 AM, Blake Rouse <email address hidden>
wrote:

> Jason,
>
> I think your theory is correct.
>
> Context on how it works:
> We do write the hostmaps into the config but dhcpd is not reloaded
> (isc-dhcpd doesn't support reload) we using the OMAPI to update the
> hostmaps so no to bounce the DHCP server. The hostmap is only written to
> the dhcpd.conf just incase the leases file is deleted and dhcpd is
> restarted.
>
> The updating of this information is out of band, from the machine being
> turned on. Meaning that MAAS does not block until hostmap is written
> before powering the machine on.
>
> We will will need to look into how we can verify that its written before
> actually powering the machine on. I also think you are correct in that
> normal machines do not have this problem because they take long enough
> to power up before it makes its first PXE request.
>
> Really thanks for digging into this and figuring out what was really
> going wrong. The hard part now is fixing it.
>
> Thanks!
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1707999
>
> Title:
> pod VM fails to PXE boot after receiving multiple DHCP offers from
> both primary and secondary rack controllers, for different IPs
>
> Status in MAAS:
> Triaged
>
> Bug description:
> A VM failed to PXE boot after receiving multiple DHCP offers.
>
> You can see this here on a log from the secondary controller:
> http://paste.ubuntu.com/25221939/
>
> The node is offered both 10.245.208.201 and 10.245.208.120, tries to
> get 10.245.208.120, and is refused.
>
> One strange thing is that it seems like the DHCP server on both the
> primary controller and the secondary controller are responding. The
> primary controller's log doesn't have the offer for 10.245.208.120 - only
> the offer for 10.245.208.201:
> http://paste.ubuntu.com/25221952/
>
> This is in an HA setup: region API's are at 10.245.208.30,
> 10.245.208.31 and 10.245.208.32. We're using hacluster to load
> balance, and a VIP in front at 10.245.208.33. There are rack
> controllers on 10.245.208.30 and 10.245.208.31. For the untagged vlan
> this VM is trying to boot from, 10.245.208.30 is set as the primary
> controller, and 10.245.208.31 is set as the secondary.
>
> Primary postgres is on 10.245.208.30, it's being replicated to backup
> postgres on 10.245.208.31. It has a VIP at 10.245.208.34.
>
> We don't hit this everytime - on this deployment only one machine out
> of about 30 hit this.
>
> I've attached logs from the maas servers.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1707999/+subscriptions
>

Great!  Thanks Blake

On Tue, Sep 26, 2017 at 11:12 AM, Blake Rouse <blake.rouse@canonical.com>
wrote:

> Jason,
>
> I think your theory is correct.
>
> Context on how it works:
> We do write the hostmaps into the config but dhcpd is not reloaded
> (isc-dhcpd doesn't support reload) we using the OMAPI to update the
> hostmaps so no to bounce the DHCP server. The hostmap is only written to
> the dhcpd.conf just incase the leases file is deleted and dhcpd is
> restarted.
>
> The updating of this information is out of band, from the machine being
> turned on. Meaning that MAAS does not block until hostmap is written
> before powering the machine on.
>
> We will will need to look into how we can verify that its written before
> actually powering the machine on. I also think you are correct in that
> normal machines do not have this problem because they take long enough
> to power up before it makes its first PXE request.
>
> Really thanks for digging into this and figuring out what was really
> going wrong. The hard part now is fixing it.
>
> Thanks!
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1707999
>
> Title:
>   pod VM fails to PXE boot after receiving multiple DHCP offers from
>   both primary and secondary rack controllers, for different IPs
>
> Status in MAAS:
>   Triaged
>
> Bug description:
>   A VM failed to PXE boot after receiving multiple DHCP offers.
>
>   You can see this here on a log from the secondary controller:
>   http://paste.ubuntu.com/25221939/
>
>   The node is offered both 10.245.208.201 and 10.245.208.120, tries to
>   get 10.245.208.120, and is refused.
>
>   One strange thing is that it seems like the DHCP server on both the
> primary controller and the secondary controller are responding.  The
> primary controller's log doesn't have the offer for 10.245.208.120 - only
> the offer for 10.245.208.201:
>   http://paste.ubuntu.com/25221952/
>
>   This is in an HA setup: region API's are at 10.245.208.30,
>   10.245.208.31 and 10.245.208.32. We're using hacluster to load
>   balance, and a VIP in front at 10.245.208.33. There are rack
>   controllers on 10.245.208.30 and 10.245.208.31. For the untagged vlan
>   this VM is trying to boot from, 10.245.208.30 is set as the primary
>   controller, and 10.245.208.31 is set as the secondary.
>
>   Primary postgres is on 10.245.208.30, it's being replicated to backup
>   postgres on 10.245.208.31. It has a VIP at 10.245.208.34.
>
>   We don't hit this everytime - on this deployment only one machine out
>   of about 30 hit this.
>
>   I've attached logs from the maas servers.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1707999/+subscriptions
>