Comment 7 for bug 1747764

Revision history for this message
Jason Hobbs (jason-hobbs) wrote : Re: [Bug 1747764] Re: rack controller HA fails during a network partition

On Tue, Feb 6, 2018 at 3:47 PM, Andres Rodriguez
<email address hidden> wrote:
> Hey Jason,
>
> The information you provide is not enough for us to determine the
> configuration that you have. Since you now have a set environment, it
> would be ideal to have a graph or something that show us how you are
> configuring your MAAS HA environment, as it is difficult to understand
> without having a picture of how things are physically connected.
>
> That said. I have a few questions. As I understand, you have 1 rack
> controller isolated from the *VIP* of the region controller. I guess
> this means that rackd.conf points to the VIP. But:
>
> 1. Can the rack controller connect to the region controllers directly ?

The isolated rack controller can't talk to either of the other two
nodes at all. They don't see any traffic from it on their own IPs or
on the VIPs.

> 2. What is the state of the rack controller once it cannot connect to the VIP?

rackd is still running and spewing lots of errors, dhcpd is still
running, ntpd, tgtd, etc.

> 3. Is this rack controller a secondary for DHCP HA?

Yes.

> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1747764
>
> Title:
> rack controller HA fails during a network partition
>
> Status in MAAS:
> Incomplete
>
> Bug description:
> I have an HA setup with 3 MAAS controllers, each running rack
> controllers and region controllers.
>
> On two of the three controllers, I used iptables to drop traffic from
> the third, to simulate a network partition.
>
> Then I instructed MAAS to deploy a node. The node powered on fine,
> but when it started PXE booting, the third isolated rack controller
> responded to the DHCP request, gave it an IP, and told it to talk to
> it via tftp to get its pxelinux.cfg.
>
> That rack controller was unable to provide the pxelinux.cfg because it
> couldn't reach the region controller via the VIP due to the network
> partition, and the node failed to PXE boot.
>
> I think that the isolated rack controller should not be running DHCP.
> If a rack controller can't reach the region controller, it can't
> handle PXE booting a node, and shouldn't try. If it would not have
> responded, one of the functional rack controllers would have and it
> would be fine.
>
> In the attached logs, 10.245.31.4 is the node that was isolated. I
> started the isolation at about 21:15.
>
> This is with 2.3.0-6434-gd354690-0ubuntu1~16.04.1.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1747764/+subscriptions