maas become unresponsive with maasserver_notification stuck at concurrent update

Bug #1843268 reported by Dylan Wang
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
High
Unassigned

Bug Description

I'm not sure what cause it, but we are running 75 rack controllers, since a few days ago, our maas region controller become unresponsive, most of the CPU are consumed by postgres, meanwhile, everything looks normal in regiond.log

We dig furture and found tons of error log in pg like this:

 2019-09-09 21:26:26.683 CST [15196] maas@maasdb ERROR: could not serialize access due to concurrent update
2019-09-09 21:26:26.683 CST [15196] maas@maasdb STATEMENT: UPDATE "maasserver_notification" SET "updated" = '2019-09-09T21:26:26.250962'::timestamp, "message" = '45 rack controllers are not yet connected to the region. Visit the <a href="/MAAS/#/controllers">rack controllers page</a> for more information.' WHERE "maasserver_notification"."id" = 381635
2019-09-09 21:26:26.683 CST [15306] maas@maasdb ERROR: could not serialize access due to concurrent update
2019-09-09 21:26:26.683 CST [15306] maas@maasdb STATEMENT: UPDATE "maasserver_notification" SET "updated" = '2019-09-09T21:26:26.325280'::timestamp, "message" = '45 rack controllers are not yet connected to the region. Visit the <a href="/MAAS/#/controllers">rack controllers page</a> for more information.' WHERE "maasserver_notification"."id" = 381635
2019-09-09 21:26:28.160 CST [15206] maas@maasdb ERROR: could not serialize access due to concurrent update
2019-09-09 21:26:28.160 CST [15206] maas@maasdb STATEMENT: UPDATE "maasserver_notification" SET "updated" = '2019-09-09T21:26:27.700467'::timestamp, "message" = '46 rack controllers are not yet connected to the region. Visit the <a href="/MAAS/#/controllers">rack controllers page</a> for more information.' WHERE "maasserver_notification"."id" = 381635
2019-09-09 21:26:28.160 CST [15296] maas@maasdb ERROR: could not serialize access due to concurrent update

I suspect that, we have too many rack controller connect to different regiond process, each of them are trying to update this table, all the updates queued up, eventaully cause maas server hang.

This is reproduceable everytime I restart regiond or add more rack controller, the only way to recover from this is stop some running rackd.

Related branches

Dylan Wang (hyuwang)
description: updated
Changed in maas:
status: New → Triaged
importance: Undecided → High
milestone: none → 2.7.0alpha1
Changed in maas:
milestone: 2.7.0b1 → 2.7.0b2
Changed in maas:
milestone: 2.7.0b2 → none
Revision history for this message
Dylan Wang (hyuwang) wrote :

any updates for this issue?
Why it's not in the milestone anymore...

tags: added: bug-council
Revision history for this message
Adam Collard (adam-collard) wrote :

Let's remove the middleware that checks the rack connectivity and instead create a new service that periodically checks for rack controllers being connected and notifying admins when there's action required.

Changed in maas:
milestone: none → 3.3.0
tags: removed: bug-council
Revision history for this message
Jerzy Husakowski (jhusakowski) wrote :

Let's implement Adam's suggestion.

Revision history for this message
Adam Collard (adam-collard) wrote :

The linked branch which should land in 3.3 shortly removes the (arguably insane) check in middleware for every request to verify rack controllers are connected.

Status for rack controller connectivity and liveness is already modelled in the Controllers section of the UI.

Changed in maas:
status: Triaged → Fix Committed
Changed in maas:
milestone: 3.3.0 → 3.3.0-beta3
Changed in maas:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.