Comment 36 for bug 1707999

Revision history for this message
Jason Hobbs (jason-hobbs) wrote : Re: [Bug 1707999] Re: pod VM fails to PXE boot after receiving multiple DHCP offers from both primary and secondary rack controllers, for different IPs

Great! Thanks Blake

On Tue, Sep 26, 2017 at 11:12 AM, Blake Rouse <email address hidden>
wrote:

> Jason,
>
> I think your theory is correct.
>
> Context on how it works:
> We do write the hostmaps into the config but dhcpd is not reloaded
> (isc-dhcpd doesn't support reload) we using the OMAPI to update the
> hostmaps so no to bounce the DHCP server. The hostmap is only written to
> the dhcpd.conf just incase the leases file is deleted and dhcpd is
> restarted.
>
> The updating of this information is out of band, from the machine being
> turned on. Meaning that MAAS does not block until hostmap is written
> before powering the machine on.
>
> We will will need to look into how we can verify that its written before
> actually powering the machine on. I also think you are correct in that
> normal machines do not have this problem because they take long enough
> to power up before it makes its first PXE request.
>
> Really thanks for digging into this and figuring out what was really
> going wrong. The hard part now is fixing it.
>
> Thanks!
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1707999
>
> Title:
> pod VM fails to PXE boot after receiving multiple DHCP offers from
> both primary and secondary rack controllers, for different IPs
>
> Status in MAAS:
> Triaged
>
> Bug description:
> A VM failed to PXE boot after receiving multiple DHCP offers.
>
> You can see this here on a log from the secondary controller:
> http://paste.ubuntu.com/25221939/
>
> The node is offered both 10.245.208.201 and 10.245.208.120, tries to
> get 10.245.208.120, and is refused.
>
> One strange thing is that it seems like the DHCP server on both the
> primary controller and the secondary controller are responding. The
> primary controller's log doesn't have the offer for 10.245.208.120 - only
> the offer for 10.245.208.201:
> http://paste.ubuntu.com/25221952/
>
> This is in an HA setup: region API's are at 10.245.208.30,
> 10.245.208.31 and 10.245.208.32. We're using hacluster to load
> balance, and a VIP in front at 10.245.208.33. There are rack
> controllers on 10.245.208.30 and 10.245.208.31. For the untagged vlan
> this VM is trying to boot from, 10.245.208.30 is set as the primary
> controller, and 10.245.208.31 is set as the secondary.
>
> Primary postgres is on 10.245.208.30, it's being replicated to backup
> postgres on 10.245.208.31. It has a VIP at 10.245.208.34.
>
> We don't hit this everytime - on this deployment only one machine out
> of about 30 hit this.
>
> I've attached logs from the maas servers.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1707999/+subscriptions
>