[2.4] MAAS UI becomes unresponsive (slow) when there's disconnected controllers/RPC errors

Bug #1762461 reported by Andres Rodriguez
22
This bug affects 3 people
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
Critical
Lee Trager
2.4
In Progress
Critical
Lee Trager

Bug Description

Upgrading my primary region/rack, MAAS bacame bvery unresponsive and in the logs I saw the following errors.

This seems like a new recurrence of this old bug: https://bugs.launchpad.net/maas/+bug/1360004

2018-04-09 15:39:05 maasserver: [error] Error while calling DescribePowerTypes: RPC connection timed out to rack controller 'maas00' (6g7yrg).
2018-04-09 15:39:06 maasserver: [error] Error while calling DescribePowerTypes: Unable to get RPC connection for rack controller 'node01' (4rcbwp).
2018-04-09 15:39:10 maasserver: [error] Error while calling DescribePowerTypes: RPC connection timed out to rack controller 'maas00' (6g7yrg).
2018-04-09 15:39:10 maasserver: [error] Error while calling DescribePowerTypes: Unable to get RPC connection for rack controller 'node01' (4rcbwp).
2018-04-09 15:39:16 maasserver: [error] Error while calling DescribePowerTypes: RPC connection timed out to rack controller 'maas00' (6g7yrg).
2018-04-09 15:39:16 maasserver: [error] Error while calling DescribePowerTypes: Unable to get RPC connection for rack controller 'node01' (4rcbwp).
2018-04-09 15:39:20 maasserver: [error] Error while calling DescribePowerTypes: RPC connection timed out to rack controller 'maas00' (6g7yrg).
2018-04-09 15:39:20 maasserver: [error] Error while calling DescribePowerTypes: Unable to get RPC connection for rack controller 'node01' (4rcbwp).

Related branches

Revision history for this message
Andres Rodriguez (andreserl) wrote :
description: updated
Changed in maas:
milestone: none → 2.4.0beta2
importance: Undecided → High
status: New → Triaged
summary: - [2.4] MAAS becomes unresponsive when there's RPC errors
+ [2.4] MAAS UI becomes unresponsive when there's RPC errors
Changed in maas:
importance: High → Critical
assignee: nobody → Lee Trager (ltrager)
Revision history for this message
Andres Rodriguez (andreserl) wrote : Re: [2.4] MAAS UI becomes unresponsive when there's RPC errors
Revision history for this message
Andres Rodriguez (andreserl) wrote :

PLease see 1762461 for more details too.

Changed in maas:
milestone: 2.4.0beta2 → 2.4.0rc1
summary: - [2.4] MAAS UI becomes unresponsive when there's RPC errors
+ [2.4] MAAS UI becomes unresponsive (slow) when there's disconnected
+ controllers/RPC errors
Changed in maas:
assignee: Lee Trager (ltrager) → Blake Rouse (blake-rouse)
Changed in maas:
assignee: Blake Rouse (blake-rouse) → nobody
status: Triaged → Incomplete
Changed in maas:
milestone: 2.4.0rc1 → 2.4.0rc2
Revision history for this message
Mark Shuttleworth (sabdfl) wrote :

Not sure where we are tracking this, but I see it in the GMAAS too. I think I have one sick rack controller. Might it have to do with the IP addresses on which RPC services listen?

Revision history for this message
Mark Shuttleworth (sabdfl) wrote :

I see this:

2018-05-06 12:47:16 maasserver: [error] Error while calling DescribePowerTypes: RPC connection timed out to rack controller 'lapsi' (4nqs36).
2018-05-06 12:47:23 -: [critical] Unhandled error in EventualResult
 Traceback (most recent call last):
 Failure: twisted.internet.defer.CancelledError:

2018-05-06 12:47:26 maasserver: [error] Error while calling DescribePowerTypes: RPC connection timed out to rack controller 'lapsi' (4nqs36).

Changed in maas:
milestone: 2.4.0rc2 → 2.4.0rc1
status: Incomplete → Confirmed
assignee: nobody → Lee Trager (ltrager)
Changed in maas:
status: Confirmed → In Progress
milestone: 2.4.0rc1 → 2.4.0rc2
Changed in maas:
milestone: 2.4.0rc2 → 2.5.0
no longer affects: maas/trunk
Changed in maas:
milestone: 2.4.1 → 2.5.0
Lee Trager (ltrager)
Changed in maas:
status: In Progress → Fix Committed
Changed in maas:
milestone: 2.5.0 → 2.5.0alpha1
Changed in maas:
status: Fix Committed → Fix Released
Revision history for this message
Phil Merricks (seffyroff) wrote :

I'm getting a similar experience on 2.5.0 final here. 3 rack controllers offline and I've been waiting 45 mins to bounce the region service. This is my second attempt - the first time the box eventually ground to a complete halt.

Revision history for this message
Lee Trager (ltrager) wrote :

Hi Phil,

Could you please open a new bug and attach your MAAS logs?

Revision history for this message
Phil Merricks (seffyroff) wrote :

I will do that next time I'm running a deployment. FYI I resolved this by:

1: Restoring the racks and region to their prior state (before I shut down the racks to migrate them)
2: Deleting the racks from the region controller via the Controllers UI

With that, the Region Controller played nice and allowed me to migrate it very smoothly. Hurrah!

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.