Bug #1747764 “[2.3, service-tracking] Network partition breaks H...” : Bugs : MAAS

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2018-02-06:

#1

logs-2018-02-06-21.34.56.tar Edit (3.3 MiB, application/x-tar)

Andres Rodriguez (andreserl) on 2018-02-06

Changed in maas:
status:	New → Incomplete

Revision history for this message

Andres Rodriguez (andreserl) wrote on 2018-02-06:

#2

Hey Jason,

The information you provide is not enough for us to determine the configuration that you have. Since you now have a set environment, it would be ideal to have a graph or something that show us how you are configuring your MAAS HA environment, as it is difficult to understand without having a picture of how things are physically connected.

That said. I have a few questions. As I understand, you have 1 rack controller isolated from the *VIP* of the region controller. I guess this means that rackd.conf points to the VIP. But:

1. Can the rack controller connect to the region controllers directly ?
2. What is the state of the rack controller once it cannot connect to the VIP?
3. Is this rack controller a secondary for DHCP HA?

Revision history for this message

Andres Rodriguez (andreserl) wrote on 2018-02-06:

#3

4. How long was it from the time the network partition was confirmed to the time the machine attempted to boot?
5. After the machine failed to boot, did the rack controller continued providing DHCP ?
6. Did the rack controller at all fully disconnected for *all* regions?

Revision history for this message

Andres Rodriguez (andreserl) wrote on 2018-02-06:

#4

I see in the logs this:

2018-02-06 21:14:26 provisioningserver.rpc.clusterservice: [info] Region not available: User timeout caused connection failure. (While requesting RPC info at b'http://[::ffff:10.245.32.102]/MAAS/rpc/').
2018-02-06 21:14:57 provisioningserver.rpc.clusterservice: [info] Region not available: User timeout caused connection failure. (While requesting RPC info at b'http://[::ffff:10.245.32.102]/MAAS/rpc/').
2018-02-06 21:15:25 twisted.internet.defer: [critical] Unhandled error in Deferred:
2018-02-06 21:15:25 twisted.internet.defer: [critical]

Based on your comment that should be the time you started the partition. So a few questions, when the rack disconnected:

- seems that dhcpd continued running, did this change at all afterwards ? or it just never stopped ?
- did /var/lib/maas/dhcpd.conf seems to not have been "deleted" or updated to reflect no connection, did this remained to be the same over time ? did it ever get removed?

summary:

- rack controller HA fails during a network partition
+ [2.3, ha] rack controller HA fails during a network partition

Revision history for this message

Blake Rouse (blake-rouse) wrote on 2018-02-06: Re: [2.3, ha] rack controller HA fails during a network partition

#5

It is designed to stop DHCPD if the rack controller cannot talk to any region controllers. Just because you prevented the rack controller from talking to the region over HTTP did you prevent the RPC connections? That is a different port.

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2018-02-06: Re: [Bug 1747764] Re: rack controller HA fails during a network partition

#6

ha setup 1747764.png Edit (24.0 KiB, image/png; name="ha setup 1747764.png")

I've attached a drawing of my setup, I hope that helps. 10.245.31.4
(everitt) is the isolated controller in this test.

On Tue, Feb 6, 2018 at 3:53 PM, Andres Rodriguez
<email address hidden> wrote:
> 4. How long was it from the time the network partition was confirmed to the time the machine attempted to boot?

The first time through, what's in the logs, I started the deploy about
a minute after the rack controller lost its connection to the region
controllers.

It continued to try to boot and fail for about 20 minutes, at which
point I released the node. 45 minutes later now, I tried to deploy
the node again and I'm getting the same behavior - the isolated rack
controller is still providing dhcp.

> 5. After the machine failed to boot, did the rack controller continued providing DHCP ?

Yes.

> 6. Did the rack controller at all fully disconnected for *all* regions?

There is a region controller running on the same node as it which it
may still be able to talk to, but I don't think it's connected to it,
because it can't talk to the region's vip to get /rpc to find where it
should connect to. Even if it could, that region controller can't
talk to the DB, so it would be worthless. It can not talk to either
of the working region controllers.

> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1747764
>
> Title:
> [2.3, ha] rack controller HA fails during a network partition
>
> Status in MAAS:
> Incomplete
>
> Bug description:
> I have an HA setup with 3 MAAS controllers, each running rack
> controllers and region controllers.
>
> On two of the three controllers, I used iptables to drop traffic from
> the third, to simulate a network partition.
>
> Then I instructed MAAS to deploy a node. The node powered on fine,
> but when it started PXE booting, the third isolated rack controller
> responded to the DHCP request, gave it an IP, and told it to talk to
> it via tftp to get its pxelinux.cfg.
>
> That rack controller was unable to provide the pxelinux.cfg because it
> couldn't reach the region controller via the VIP due to the network
> partition, and the node failed to PXE boot.
>
> I think that the isolated rack controller should not be running DHCP.
> If a rack controller can't reach the region controller, it can't
> handle PXE booting a node, and shouldn't try. If it would not have
> responded, one of the functional rack controllers would have and it
> would be fine.
>
> In the attached logs, 10.245.31.4 is the node that was isolated. I
> started the isolation at about 21:15.
>
> This is with 2.3.0-6434-gd354690-0ubuntu1~16.04.1.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1747764/+subscriptions

I've attached a drawing of my setup, I hope that helps.  10.245.31.4
(everitt) is the isolated controller in this test.

On Tue, Feb 6, 2018 at 3:53 PM, Andres Rodriguez
<andreserl@ubuntu-pe.org> wrote:
> 4. How long was it from the time the network partition was confirmed to the time the machine attempted to boot?

The first time through, what's in the logs, I started the deploy about
a minute after the rack controller lost its connection to the region
controllers.

It continued to try to boot and fail for about 20 minutes, at which
point I released the node.  45 minutes later now, I tried to deploy
the node again and I'm getting the same behavior - the isolated rack
controller is still providing dhcp.

> 5. After the machine failed to boot, did the rack controller continued providing DHCP ?

Yes.

> 6. Did the rack controller at all fully disconnected for *all* regions?

There is a region controller running on the same node as it which it
may still be able to talk to, but I don't think it's connected to it,
because it can't talk to the region's vip to get /rpc to find where it
should connect to.  Even if it could, that region controller can't
talk to the DB, so it would be worthless.  It can not talk to either
of the working region controllers.

> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1747764
>
> Title:
>   [2.3, ha] rack controller HA fails during a network partition
>
> Status in MAAS:
>   Incomplete
>
> Bug description:
>   I have an HA setup with 3 MAAS controllers, each running rack
>   controllers and region controllers.
>
>   On two of the three controllers, I used iptables to drop traffic from
>   the third, to simulate a network partition.
>
>   Then I instructed MAAS to deploy a node.  The node powered on fine,
>   but when it started PXE booting, the third isolated rack controller
>   responded to the DHCP request, gave it an IP, and told it to talk to
>   it via tftp to get its pxelinux.cfg.
>
>   That rack controller was unable to provide the pxelinux.cfg because it
>   couldn't reach the region controller via the VIP due to the network
>   partition, and the node failed to PXE boot.
>
>   I think that the isolated rack controller should not be running DHCP.
>   If a rack controller can't reach the region controller, it can't
>   handle PXE booting a node, and shouldn't try.  If it would not have
>   responded, one of the functional rack controllers would have and it
>   would be fine.
>
>   In the attached logs, 10.245.31.4 is the node that was isolated.  I
>   started the isolation at about 21:15.
>
>   This is with 2.3.0-6434-gd354690-0ubuntu1~16.04.1.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1747764/+subscriptions

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2018-02-06:

#7

On Tue, Feb 6, 2018 at 3:47 PM, Andres Rodriguez
<email address hidden> wrote:
> Hey Jason,
>
> The information you provide is not enough for us to determine the
> configuration that you have. Since you now have a set environment, it
> would be ideal to have a graph or something that show us how you are
> configuring your MAAS HA environment, as it is difficult to understand
> without having a picture of how things are physically connected.
>
> That said. I have a few questions. As I understand, you have 1 rack
> controller isolated from the *VIP* of the region controller. I guess
> this means that rackd.conf points to the VIP. But:
>
> 1. Can the rack controller connect to the region controllers directly ?

The isolated rack controller can't talk to either of the other two
nodes at all. They don't see any traffic from it on their own IPs or
on the VIPs.

> 2. What is the state of the rack controller once it cannot connect to the VIP?

rackd is still running and spewing lots of errors, dhcpd is still
running, ntpd, tgtd, etc.

> 3. Is this rack controller a secondary for DHCP HA?

Yes.

> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1747764
>
> Title:
> rack controller HA fails during a network partition
>
> Status in MAAS:
> Incomplete
>
> Bug description:
> I have an HA setup with 3 MAAS controllers, each running rack
> controllers and region controllers.
>
> On two of the three controllers, I used iptables to drop traffic from
> the third, to simulate a network partition.
>
> Then I instructed MAAS to deploy a node. The node powered on fine,
> but when it started PXE booting, the third isolated rack controller
> responded to the DHCP request, gave it an IP, and told it to talk to
> it via tftp to get its pxelinux.cfg.
>
> That rack controller was unable to provide the pxelinux.cfg because it
> couldn't reach the region controller via the VIP due to the network
> partition, and the node failed to PXE boot.
>
> I think that the isolated rack controller should not be running DHCP.
> If a rack controller can't reach the region controller, it can't
> handle PXE booting a node, and shouldn't try. If it would not have
> responded, one of the functional rack controllers would have and it
> would be fine.
>
> In the attached logs, 10.245.31.4 is the node that was isolated. I
> started the isolation at about 21:15.
>
> This is with 2.3.0-6434-gd354690-0ubuntu1~16.04.1.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1747764/+subscriptions

On Tue, Feb 6, 2018 at 3:47 PM, Andres Rodriguez
<andreserl@ubuntu-pe.org> wrote:
> Hey Jason,
>
> The information you provide is not enough for us to determine the
> configuration that you have. Since you now have a set environment, it
> would be ideal to have a graph or something that show us how you are
> configuring your MAAS HA environment, as it is difficult to understand
> without having a picture of how things are physically connected.
>
> That said. I have a few questions. As I understand, you have 1 rack
> controller isolated from the *VIP* of the region controller. I guess
> this means that rackd.conf points to the VIP. But:
>
> 1. Can the rack controller connect to the region controllers directly ?

The isolated rack controller can't talk to either of the other two
nodes at all.  They don't see any traffic from it on their own IPs or
on the VIPs.

> 2. What is the state of the rack controller once it cannot connect to the VIP?

rackd is still running and spewing lots of errors, dhcpd is still
running, ntpd, tgtd, etc.

> 3. Is this rack controller a secondary for DHCP HA?

Yes.

> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1747764
>
> Title:
>   rack controller HA fails during a network partition
>
> Status in MAAS:
>   Incomplete
>
> Bug description:
>   I have an HA setup with 3 MAAS controllers, each running rack
>   controllers and region controllers.
>
>   On two of the three controllers, I used iptables to drop traffic from
>   the third, to simulate a network partition.
>
>   Then I instructed MAAS to deploy a node.  The node powered on fine,
>   but when it started PXE booting, the third isolated rack controller
>   responded to the DHCP request, gave it an IP, and told it to talk to
>   it via tftp to get its pxelinux.cfg.
>
>   That rack controller was unable to provide the pxelinux.cfg because it
>   couldn't reach the region controller via the VIP due to the network
>   partition, and the node failed to PXE boot.
>
>   I think that the isolated rack controller should not be running DHCP.
>   If a rack controller can't reach the region controller, it can't
>   handle PXE booting a node, and shouldn't try.  If it would not have
>   responded, one of the functional rack controllers would have and it
>   would be fine.
>
>   In the attached logs, 10.245.31.4 is the node that was isolated.  I
>   started the isolation at about 21:15.
>
>   This is with 2.3.0-6434-gd354690-0ubuntu1~16.04.1.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1747764/+subscriptions

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2018-02-06: Re: [Bug 1747764] Re: [2.3, ha] rack controller HA fails during a network partition

#8

I blocked all IP traffic between the isolated system and the other two
systems, so it couldn't talk either via RPC or HTTP. dhcpd never
stopped on the isolated system, and is still running right now.

On Tue, Feb 6, 2018 at 4:03 PM, Blake Rouse <email address hidden> wrote:
> It is designed to stop DHCPD if the rack controller cannot talk to any
> region controllers. Just because you prevented the rack controller from
> talking to the region over HTTP did you prevent the RPC connections?
> That is a different port.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1747764
>
> Title:
> [2.3, ha] rack controller HA fails during a network partition
>
> Status in MAAS:
> Incomplete
>
> Bug description:
> I have an HA setup with 3 MAAS controllers, each running rack
> controllers and region controllers.
>
> On two of the three controllers, I used iptables to drop traffic from
> the third, to simulate a network partition.
>
> Then I instructed MAAS to deploy a node. The node powered on fine,
> but when it started PXE booting, the third isolated rack controller
> responded to the DHCP request, gave it an IP, and told it to talk to
> it via tftp to get its pxelinux.cfg.
>
> That rack controller was unable to provide the pxelinux.cfg because it
> couldn't reach the region controller via the VIP due to the network
> partition, and the node failed to PXE boot.
>
> I think that the isolated rack controller should not be running DHCP.
> If a rack controller can't reach the region controller, it can't
> handle PXE booting a node, and shouldn't try. If it would not have
> responded, one of the functional rack controllers would have and it
> would be fine.
>
> In the attached logs, 10.245.31.4 is the node that was isolated. I
> started the isolation at about 21:15.
>
> This is with 2.3.0-6434-gd354690-0ubuntu1~16.04.1.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1747764/+subscriptions

I blocked all IP traffic between the isolated system and the other two
systems, so it couldn't talk either via RPC or HTTP.  dhcpd never
stopped on the isolated system, and is still running right now.

On Tue, Feb 6, 2018 at 4:03 PM, Blake Rouse <blake.rouse@canonical.com> wrote:
> It is designed to stop DHCPD if the rack controller cannot talk to any
> region controllers. Just because you prevented the rack controller from
> talking to the region over HTTP did you prevent the RPC connections?
> That is a different port.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1747764
>
> Title:
>   [2.3, ha] rack controller HA fails during a network partition
>
> Status in MAAS:
>   Incomplete
>
> Bug description:
>   I have an HA setup with 3 MAAS controllers, each running rack
>   controllers and region controllers.
>
>   On two of the three controllers, I used iptables to drop traffic from
>   the third, to simulate a network partition.
>
>   Then I instructed MAAS to deploy a node.  The node powered on fine,
>   but when it started PXE booting, the third isolated rack controller
>   responded to the DHCP request, gave it an IP, and told it to talk to
>   it via tftp to get its pxelinux.cfg.
>
>   That rack controller was unable to provide the pxelinux.cfg because it
>   couldn't reach the region controller via the VIP due to the network
>   partition, and the node failed to PXE boot.
>
>   I think that the isolated rack controller should not be running DHCP.
>   If a rack controller can't reach the region controller, it can't
>   handle PXE booting a node, and shouldn't try.  If it would not have
>   responded, one of the functional rack controllers would have and it
>   would be fine.
>
>   In the attached logs, 10.245.31.4 is the node that was isolated.  I
>   started the isolation at about 21:15.
>
>   This is with 2.3.0-6434-gd354690-0ubuntu1~16.04.1.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1747764/+subscriptions

Changed in maas:
status:	Incomplete → New

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2018-02-06: Re: [2.3, ha] rack controller HA fails during a network partition

#9

To be clear, to simulate the network partition by blocking traffic, I ran "/sbin/iptables -I INPUT -s 10.245.31.4 -j DROP" on 10.245.31.1 and 10.245.31.3.

Revision history for this message

Andres Rodriguez (andreserl) wrote on 2018-02-06:

#10

IMHO, based on the logs from rackd.log it seems that after the rack is unable to connect, there are a lot of tracebacks due to unhandled errors. These unhandled errors could be blocking the code that stops dhcpd.

So, I see various improvements or different bugs. rackd shouldn't traceback on unhandled errors, in return, it should recognize it cannot do what it needs to do and stop all services, and that includes:

1. spitting out a message that because of the connection it cannot update ntp:
2018-02-06 21:15:33 provisioningserver.rackdservices.ntp: [critical] Failed to update NTP configuration.
[...]
twisted.protocols.amp.UnhandledCommand: (b'UNHANDLED', 'Unknown Error [everitt:pid=1933:cmd=GetTimeConfiguration:ask=58f3]')

2. The neighbours discovery should do the same as above (and in fact, it could be this isse the one that's preventing the rack on stopping dhcpd)

Failure: twisted.protocols.amp.UnhandledCommand: (b'UNHANDLED', 'Unknown Error [everitt:pid=1944:cmd=ReportNeighbours:ask=5976]')
2018-02-06 21:17:01 provisioningserver.rpc.clusterservice: [info] Region not available: User timeout caused connection failure. (While requesting RPC info at b'http://[::ffff:10.245.32.102]/MAAS/rpc/').
2018-02-06 21:17:09 twisted.internet.defer: [critical] Unhandled error in Deferred:
2018-02-06 21:17:09 twisted.internet.defer: [critical]

Traceback (most recent call last):
Failure: twisted.protocols.amp.UnhandledCommand: (b'UNHANDLED', 'Unknown Error [everitt:pid=1946:cmd=ReportNeighbours:ask=5914]')
2018-02-06 21:17:10 twisted.internet.defer: [critical] Unhandled error in Deferred:
2018-02-06 21:17:10 twisted.internet.defer: [critical]

Traceback (most recent call last):
Failure: twisted.protocols.amp.UnhandledCommand: (b'UNHANDLED', 'Unknown Error [everitt:pid=1944:cmd=ReportNeighbours:ask=5977]')
2018-02-06 21:17:10 twisted.internet.defer: [critical] Unhandled error in Deferred:
2018-02-06 21:17:10 twisted.internet.defer: [critical]

Traceback (most recent call last):
Failure: twisted.protocols.amp.UnhandledCommand: (b'UNHANDLED', 'Unknown Error [everitt:pid=1944:cmd=ReportNeighbours:ask=5978]')

3. Same for image download service:

2018-02-06 21:17:33 provisioningserver.rackdservices.image_download_service: [critical] Downloading images failed.
[...]
twisted.protocols.amp.UnhandledCommand: (b'UNHANDLED', 'Unknown Error [everitt:pid=1946:cmd=GetProxies:ask=5916]')