I'm not sure what cause it, but we are running 75 rack controllers, since a few days ago, our maas region controller become unresponsive, most of the CPU are consumed by postgres, meanwhile, everything looks normal in regiond.log
We dig furture and found tons of error log in pg like this:
2019-09-09 21:26:26.683 CST [15196] maas@maasdb ERROR: could not serialize access due to concurrent update
2019-09-09 21:26:26.683 CST [15196] maas@maasdb STATEMENT: UPDATE "maasserver_notification" SET "updated" = '2019-09-09T21:26:26.250962'::timestamp, "message" = '45 rack controllers are not yet connected to the region. Visit the <a href="/MAAS/#/controllers">rack controllers page</a> for more information.' WHERE "maasserver_notification"."id" = 381635
2019-09-09 21:26:26.683 CST [15306] maas@maasdb ERROR: could not serialize access due to concurrent update
2019-09-09 21:26:26.683 CST [15306] maas@maasdb STATEMENT: UPDATE "maasserver_notification" SET "updated" = '2019-09-09T21:26:26.325280'::timestamp, "message" = '45 rack controllers are not yet connected to the region. Visit the <a href="/MAAS/#/controllers">rack controllers page</a> for more information.' WHERE "maasserver_notification"."id" = 381635
2019-09-09 21:26:28.160 CST [15206] maas@maasdb ERROR: could not serialize access due to concurrent update
2019-09-09 21:26:28.160 CST [15206] maas@maasdb STATEMENT: UPDATE "maasserver_notification" SET "updated" = '2019-09-09T21:26:27.700467'::timestamp, "message" = '46 rack controllers are not yet connected to the region. Visit the <a href="/MAAS/#/controllers">rack controllers page</a> for more information.' WHERE "maasserver_notification"."id" = 381635
2019-09-09 21:26:28.160 CST [15296] maas@maasdb ERROR: could not serialize access due to concurrent update
I suspect that, we have too many rack controller connect to different regiond process, each of them are trying to update this table, all the updates queued up, eventaully cause maas server hang.
This is reproduceable everytime I restart regiond or add more rack controller, the only way to recover from this is stop some running rackd.
any updates for this issue?
Why it's not in the milestone anymore...