[2.3, service-tracking] Network partition breaks HA rack controller which doesn't stop services

Bug #1747764 reported by Jason Hobbs on 2018-02-06
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
High
Blake Rouse
2.3
High
Unassigned

Bug Description

I have an HA setup with 3 MAAS controllers, each running rack controllers and region controllers.

On two of the three controllers, I used iptables to drop traffic from the third, to simulate a network partition.

Then I instructed MAAS to deploy a node. The node powered on fine, but when it started PXE booting, the third isolated rack controller responded to the DHCP request, gave it an IP, and told it to talk to it via tftp to get its pxelinux.cfg.

That rack controller was unable to provide the pxelinux.cfg because it couldn't reach the region controller via the VIP due to the network partition, and the node failed to PXE boot.

I think that the isolated rack controller should not be running DHCP. If a rack controller can't reach the region controller, it can't handle PXE booting a node, and shouldn't try. If it would not have responded, one of the functional rack controllers would have and it would be fine.

In the attached logs, 10.245.31.4 is the node that was isolated. I started the isolation at about 21:15.

This is with 2.3.0-6434-gd354690-0ubuntu1~16.04.1.

Related branches

Jason Hobbs (jason-hobbs) wrote :
Changed in maas:
status: New → Incomplete
Andres Rodriguez (andreserl) wrote :

Hey Jason,

The information you provide is not enough for us to determine the configuration that you have. Since you now have a set environment, it would be ideal to have a graph or something that show us how you are configuring your MAAS HA environment, as it is difficult to understand without having a picture of how things are physically connected.

That said. I have a few questions. As I understand, you have 1 rack controller isolated from the *VIP* of the region controller. I guess this means that rackd.conf points to the VIP. But:

1. Can the rack controller connect to the region controllers directly ?
2. What is the state of the rack controller once it cannot connect to the VIP?
3. Is this rack controller a secondary for DHCP HA?

Andres Rodriguez (andreserl) wrote :

4. How long was it from the time the network partition was confirmed to the time the machine attempted to boot?
5. After the machine failed to boot, did the rack controller continued providing DHCP ?
6. Did the rack controller at all fully disconnected for *all* regions?

Andres Rodriguez (andreserl) wrote :

I see in the logs this:

2018-02-06 21:14:26 provisioningserver.rpc.clusterservice: [info] Region not available: User timeout caused connection failure. (While requesting RPC info at b'http://[::ffff:10.245.32.102]/MAAS/rpc/').
2018-02-06 21:14:57 provisioningserver.rpc.clusterservice: [info] Region not available: User timeout caused connection failure. (While requesting RPC info at b'http://[::ffff:10.245.32.102]/MAAS/rpc/').
2018-02-06 21:15:25 twisted.internet.defer: [critical] Unhandled error in Deferred:
2018-02-06 21:15:25 twisted.internet.defer: [critical]

Based on your comment that should be the time you started the partition. So a few questions, when the rack disconnected:

 - seems that dhcpd continued running, did this change at all afterwards ? or it just never stopped ?
 - did /var/lib/maas/dhcpd.conf seems to not have been "deleted" or updated to reflect no connection, did this remained to be the same over time ? did it ever get removed?

summary: - rack controller HA fails during a network partition
+ [2.3, ha] rack controller HA fails during a network partition

It is designed to stop DHCPD if the rack controller cannot talk to any region controllers. Just because you prevented the rack controller from talking to the region over HTTP did you prevent the RPC connections? That is a different port.

I've attached a drawing of my setup, I hope that helps. 10.245.31.4
(everitt) is the isolated controller in this test.

On Tue, Feb 6, 2018 at 3:53 PM, Andres Rodriguez
<email address hidden> wrote:
> 4. How long was it from the time the network partition was confirmed to the time the machine attempted to boot?

The first time through, what's in the logs, I started the deploy about
a minute after the rack controller lost its connection to the region
controllers.

It continued to try to boot and fail for about 20 minutes, at which
point I released the node. 45 minutes later now, I tried to deploy
the node again and I'm getting the same behavior - the isolated rack
controller is still providing dhcp.

> 5. After the machine failed to boot, did the rack controller continued providing DHCP ?

Yes.

> 6. Did the rack controller at all fully disconnected for *all* regions?

There is a region controller running on the same node as it which it
may still be able to talk to, but I don't think it's connected to it,
because it can't talk to the region's vip to get /rpc to find where it
should connect to. Even if it could, that region controller can't
talk to the DB, so it would be worthless. It can not talk to either
of the working region controllers.

> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1747764
>
> Title:
> [2.3, ha] rack controller HA fails during a network partition
>
> Status in MAAS:
> Incomplete
>
> Bug description:
> I have an HA setup with 3 MAAS controllers, each running rack
> controllers and region controllers.
>
> On two of the three controllers, I used iptables to drop traffic from
> the third, to simulate a network partition.
>
> Then I instructed MAAS to deploy a node. The node powered on fine,
> but when it started PXE booting, the third isolated rack controller
> responded to the DHCP request, gave it an IP, and told it to talk to
> it via tftp to get its pxelinux.cfg.
>
> That rack controller was unable to provide the pxelinux.cfg because it
> couldn't reach the region controller via the VIP due to the network
> partition, and the node failed to PXE boot.
>
> I think that the isolated rack controller should not be running DHCP.
> If a rack controller can't reach the region controller, it can't
> handle PXE booting a node, and shouldn't try. If it would not have
> responded, one of the functional rack controllers would have and it
> would be fine.
>
> In the attached logs, 10.245.31.4 is the node that was isolated. I
> started the isolation at about 21:15.
>
> This is with 2.3.0-6434-gd354690-0ubuntu1~16.04.1.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1747764/+subscriptions

Jason Hobbs (jason-hobbs) wrote :

On Tue, Feb 6, 2018 at 3:47 PM, Andres Rodriguez
<email address hidden> wrote:
> Hey Jason,
>
> The information you provide is not enough for us to determine the
> configuration that you have. Since you now have a set environment, it
> would be ideal to have a graph or something that show us how you are
> configuring your MAAS HA environment, as it is difficult to understand
> without having a picture of how things are physically connected.
>
> That said. I have a few questions. As I understand, you have 1 rack
> controller isolated from the *VIP* of the region controller. I guess
> this means that rackd.conf points to the VIP. But:
>
> 1. Can the rack controller connect to the region controllers directly ?

The isolated rack controller can't talk to either of the other two
nodes at all. They don't see any traffic from it on their own IPs or
on the VIPs.

> 2. What is the state of the rack controller once it cannot connect to the VIP?

rackd is still running and spewing lots of errors, dhcpd is still
running, ntpd, tgtd, etc.

> 3. Is this rack controller a secondary for DHCP HA?

Yes.

> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1747764
>
> Title:
> rack controller HA fails during a network partition
>
> Status in MAAS:
> Incomplete
>
> Bug description:
> I have an HA setup with 3 MAAS controllers, each running rack
> controllers and region controllers.
>
> On two of the three controllers, I used iptables to drop traffic from
> the third, to simulate a network partition.
>
> Then I instructed MAAS to deploy a node. The node powered on fine,
> but when it started PXE booting, the third isolated rack controller
> responded to the DHCP request, gave it an IP, and told it to talk to
> it via tftp to get its pxelinux.cfg.
>
> That rack controller was unable to provide the pxelinux.cfg because it
> couldn't reach the region controller via the VIP due to the network
> partition, and the node failed to PXE boot.
>
> I think that the isolated rack controller should not be running DHCP.
> If a rack controller can't reach the region controller, it can't
> handle PXE booting a node, and shouldn't try. If it would not have
> responded, one of the functional rack controllers would have and it
> would be fine.
>
> In the attached logs, 10.245.31.4 is the node that was isolated. I
> started the isolation at about 21:15.
>
> This is with 2.3.0-6434-gd354690-0ubuntu1~16.04.1.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1747764/+subscriptions

I blocked all IP traffic between the isolated system and the other two
systems, so it couldn't talk either via RPC or HTTP. dhcpd never
stopped on the isolated system, and is still running right now.

On Tue, Feb 6, 2018 at 4:03 PM, Blake Rouse <email address hidden> wrote:
> It is designed to stop DHCPD if the rack controller cannot talk to any
> region controllers. Just because you prevented the rack controller from
> talking to the region over HTTP did you prevent the RPC connections?
> That is a different port.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1747764
>
> Title:
> [2.3, ha] rack controller HA fails during a network partition
>
> Status in MAAS:
> Incomplete
>
> Bug description:
> I have an HA setup with 3 MAAS controllers, each running rack
> controllers and region controllers.
>
> On two of the three controllers, I used iptables to drop traffic from
> the third, to simulate a network partition.
>
> Then I instructed MAAS to deploy a node. The node powered on fine,
> but when it started PXE booting, the third isolated rack controller
> responded to the DHCP request, gave it an IP, and told it to talk to
> it via tftp to get its pxelinux.cfg.
>
> That rack controller was unable to provide the pxelinux.cfg because it
> couldn't reach the region controller via the VIP due to the network
> partition, and the node failed to PXE boot.
>
> I think that the isolated rack controller should not be running DHCP.
> If a rack controller can't reach the region controller, it can't
> handle PXE booting a node, and shouldn't try. If it would not have
> responded, one of the functional rack controllers would have and it
> would be fine.
>
> In the attached logs, 10.245.31.4 is the node that was isolated. I
> started the isolation at about 21:15.
>
> This is with 2.3.0-6434-gd354690-0ubuntu1~16.04.1.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1747764/+subscriptions

Changed in maas:
status: Incomplete → New

To be clear, to simulate the network partition by blocking traffic, I ran "/sbin/iptables -I INPUT -s 10.245.31.4 -j DROP" on 10.245.31.1 and 10.245.31.3.

Andres Rodriguez (andreserl) wrote :

IMHO, based on the logs from rackd.log it seems that after the rack is unable to connect, there are a lot of tracebacks due to unhandled errors. These unhandled errors could be blocking the code that stops dhcpd.

So, I see various improvements or different bugs. rackd shouldn't traceback on unhandled errors, in return, it should recognize it cannot do what it needs to do and stop all services, and that includes:

1. spitting out a message that because of the connection it cannot update ntp:
2018-02-06 21:15:33 provisioningserver.rackdservices.ntp: [critical] Failed to update NTP configuration.
[...]
twisted.protocols.amp.UnhandledCommand: (b'UNHANDLED', 'Unknown Error [everitt:pid=1933:cmd=GetTimeConfiguration:ask=58f3]')

2. The neighbours discovery should do the same as above (and in fact, it could be this isse the one that's preventing the rack on stopping dhcpd)

Failure: twisted.protocols.amp.UnhandledCommand: (b'UNHANDLED', 'Unknown Error [everitt:pid=1944:cmd=ReportNeighbours:ask=5976]')
2018-02-06 21:17:01 provisioningserver.rpc.clusterservice: [info] Region not available: User timeout caused connection failure. (While requesting RPC info at b'http://[::ffff:10.245.32.102]/MAAS/rpc/').
2018-02-06 21:17:09 twisted.internet.defer: [critical] Unhandled error in Deferred:
2018-02-06 21:17:09 twisted.internet.defer: [critical]

Traceback (most recent call last):
Failure: twisted.protocols.amp.UnhandledCommand: (b'UNHANDLED', 'Unknown Error [everitt:pid=1946:cmd=ReportNeighbours:ask=5914]')
2018-02-06 21:17:10 twisted.internet.defer: [critical] Unhandled error in Deferred:
2018-02-06 21:17:10 twisted.internet.defer: [critical]

Traceback (most recent call last):
Failure: twisted.protocols.amp.UnhandledCommand: (b'UNHANDLED', 'Unknown Error [everitt:pid=1944:cmd=ReportNeighbours:ask=5977]')
2018-02-06 21:17:10 twisted.internet.defer: [critical] Unhandled error in Deferred:
2018-02-06 21:17:10 twisted.internet.defer: [critical]

Traceback (most recent call last):
Failure: twisted.protocols.amp.UnhandledCommand: (b'UNHANDLED', 'Unknown Error [everitt:pid=1944:cmd=ReportNeighbours:ask=5978]')

3. Same for image download service:

2018-02-06 21:17:33 provisioningserver.rackdservices.image_download_service: [critical] Downloading images failed.
[...]
twisted.protocols.amp.UnhandledCommand: (b'UNHANDLED', 'Unknown Error [everitt:pid=1946:cmd=GetProxies:ask=5916]')

Changed in maas:
milestone: none → 2.4.x
milestone: 2.4.x → 2.4.0alpha1
status: New → Triaged
importance: Undecided → High
tags: added: performance
summary: - [2.3, ha] rack controller HA fails during a network partition
+ [2.3, ha] Network partition for HA rack controller doesn't stop services
summary: - [2.3, ha] Network partition for HA rack controller doesn't stop services
+ [2.3, ha] Network partition breaks HA rack controller which doesn't stop
+ services
Changed in maas:
milestone: 2.4.0alpha1 → 2.4.0alpha2
Changed in maas:
milestone: 2.4.0alpha2 → 2.4.0beta1
summary: - [2.3, ha] Network partition breaks HA rack controller which doesn't stop
- services
+ [2.3, ha, b1] Network partition breaks HA rack controller which doesn't
+ stop services
summary: - [2.3, ha, b1] Network partition breaks HA rack controller which doesn't
- stop services
+ [2.4, service-tracking, 2.3] Network partition breaks HA rack controller
+ which doesn't stop services
Changed in maas:
milestone: 2.4.0beta1 → 2.4.0beta2
summary: - [2.4, service-tracking, 2.3] Network partition breaks HA rack controller
+ [2.3, service-tracking] Network partition breaks HA rack controller
which doesn't stop services
Changed in maas:
assignee: nobody → Blake Rouse (blake-rouse)
status: Triaged → In Progress
Changed in maas:
milestone: 2.4.0beta2 → 2.4.0beta3
milestone: 2.4.0beta3 → 2.4.0rc1
Changed in maas:
status: In Progress → Fix Committed
Changed in maas:
milestone: 2.4.0rc1 → 2.4.0beta3
Changed in maas:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers