MAAS

maas become unresponsive with maasserver_notification stuck at concurrent update

Bug #1843268 reported by Dylan Wang on 2019-09-09

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	MAAS	Fix Released	High	Unassigned	MAAS 3.3.0-beta3

Bug Description

I'm not sure what cause it, but we are running 75 rack controllers, since a few days ago, our maas region controller become unresponsive, most of the CPU are consumed by postgres, meanwhile, everything looks normal in regiond.log

We dig furture and found tons of error log in pg like this:

2019-09-09 21:26:26.683 CST [15196] maas@maasdb ERROR: could not serialize access due to concurrent update
2019-09-09 21:26:26.683 CST [15196] maas@maasdb STATEMENT: UPDATE "maasserver_notification" SET "updated" = '2019-09-09T21:26:26.250962'::timestamp, "message" = '45 rack controllers are not yet connected to the region. Visit the <a href="/MAAS/#/controllers">rack controllers page</a> for more information.' WHERE "maasserver_notification"."id" = 381635
2019-09-09 21:26:26.683 CST [15306] maas@maasdb ERROR: could not serialize access due to concurrent update
2019-09-09 21:26:26.683 CST [15306] maas@maasdb STATEMENT: UPDATE "maasserver_notification" SET "updated" = '2019-09-09T21:26:26.325280'::timestamp, "message" = '45 rack controllers are not yet connected to the region. Visit the <a href="/MAAS/#/controllers">rack controllers page</a> for more information.' WHERE "maasserver_notification"."id" = 381635
2019-09-09 21:26:28.160 CST [15206] maas@maasdb ERROR: could not serialize access due to concurrent update
2019-09-09 21:26:28.160 CST [15206] maas@maasdb STATEMENT: UPDATE "maasserver_notification" SET "updated" = '2019-09-09T21:26:27.700467'::timestamp, "message" = '46 rack controllers are not yet connected to the region. Visit the <a href="/MAAS/#/controllers">rack controllers page</a> for more information.' WHERE "maasserver_notification"."id" = 381635
2019-09-09 21:26:28.160 CST [15296] maas@maasdb ERROR: could not serialize access due to concurrent update

I suspect that, we have too many rack controller connect to different regiond process, each of them are trying to update this table, all the updates queued up, eventaully cause maas server hang.

This is reproduceable everytime I restart regiond or add more rack controller, the only way to recover from this is stop some running rackd.

See original description

Related branches

~adam-collard/maas:nuke-rack-controller-connectivity

Merged into maas:master

MAAS Lander: Approve on 2022-10-27

Alberto Donato (community): Approve on 2022-10-26

Dylan Wang (hyuwang) on 2019-09-09

description:

updated

Blake Rouse (blake-rouse) on 2019-09-23

Changed in maas:
status:	New → Triaged
importance:	Undecided → High
milestone:	none → 2.7.0alpha1

Adam Collard (adam-collard) on 2019-12-04

Changed in maas:
milestone:	2.7.0b1 → 2.7.0b2

Adam Collard (adam-collard) on 2019-12-18

Changed in maas:
milestone:	2.7.0b2 → none

Revision history for this message

Dylan Wang (hyuwang) wrote on 2020-03-05:

any updates for this issue?
Why it's not in the milestone anymore...

Adam Collard (adam-collard) on 2022-09-22

tags:

added: bug-council

Revision history for this message

Adam Collard (adam-collard) wrote on 2022-09-29:

Let's remove the middleware that checks the rack connectivity and instead create a new service that periodically checks for rack controllers being connected and notifying admins when there's action required.

Changed in maas:
milestone:	none → 3.3.0

Adam Collard (adam-collard) on 2022-09-29

tags:

removed: bug-council

Revision history for this message

Jerzy Husakowski (jhusakowski) wrote on 2022-10-20:

Let's implement Adam's suggestion.

Revision history for this message

Adam Collard (adam-collard) wrote on 2022-10-27:

The linked branch which should land in 3.3 shortly removes the (arguably insane) check in middleware for every request to verify rack controllers are connected.

Status for rack controller connectivity and liveness is already modelled in the Controllers section of the UI.

MAAS Lander (maas-lander) on 2022-10-27

Changed in maas:
status:	Triaged → Fix Committed

Alexsander de Souza (alexsander-souza) on 2022-11-18

Changed in maas:
milestone:	3.3.0 → 3.3.0-beta3

Alexsander de Souza (alexsander-souza) on 2022-11-28

Changed in maas:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.