MAAS

Machine listing got messed up

Bug #1881275 reported by Björn Tillenius on 2020-05-29

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	MAAS	Fix Released	High	Björn Tillenius	MAAS 2.8.0rc3
	maas-ui	Fix Released	Unknown	maas-ui-bugs #1211

Bug Description

This is with MAAS 2.8.0~beta4 (8510-g.f5a42ccea)

I'm not sure what happened, but the machine listing got in a weird state:

https://private-fileshare.canonical.com/~bjorn/too-many-machines.png

There should be 19 machines and 2 resource pools, but it instead showed
57 machines and 4 resource pools.

This happened on two browser tabs that both had the machine listing open.
I'm not sure what triggered it. But I see that the MAAS snap was refreshed
and thus caused a restart to happen, so it might be related to that.

In the logs I see that MAAS was updated from revision 6683 to 6723.

After debugging, it seems that the UI is not refreshing its state when it loses connection to the server and reconnects again.

See original description

Tags:

Revision history for this message

Björn Tillenius (bjornt) wrote on 2020-05-29:

I did try to simply restart the MAAS server, and that didn't trigger any problems with the listing. I wonder if there were any UI changes in the snap refresh that caused it?

tags:	added: ui
Changed in maas:
milestone:	none → 2.8.0rc1

Revision history for this message

Björn Tillenius (bjornt) wrote on 2020-05-29:

Note that this could be a backend bug. I do see some 'machine create' websocket notifications, but I'm not sure what's causing them.

Revision history for this message

Björn Tillenius (bjornt) wrote on 2020-05-29:

I'm marking this as Incomplete for now, since we don't know whether it's ui or backend bug. At the moment I think it's a backend bug, but I'm still debugging.

Changed in maas:
status:	New → Incomplete
importance:	Undecided → High

Revision history for this message

Björn Tillenius (bjornt) wrote on 2020-06-01:

It seems like the error is related to MAAS starting up. I tried having a tab open for MAAS where MAAS had been already started a long time ago. I couldn't reproduce the error.

But after leaving the tab open on the machine listing and then restarting MAAS, I see the error happening within a minute after the listing is showing again.

I don't see anything obvious in the logs, but this looks odd:

2020-06-01 10:40:41 stderr: [error] request to http://127.0.0.1:5240/MAAS/metadata/2012-03-01/ failed. sleeping 4.: HTTP Error 503: Service Unavailable

It keeps trying, but never succeeds.

Björn Tillenius (bjornt) on 2020-06-01

Changed in maas:
assignee:	nobody → Björn Tillenius (bjornt)
tags:	removed: ui

Revision history for this message

Björn Tillenius (bjornt) wrote on 2020-06-01:

Another datapoint. MAAS failed to query the BMC for three machines, pidgey, natasha, and opelt:

2020-06-01T13:56:47.082420+00:00 jenkins-slave-2 maas.power: [error] pidgey: Power state could not be queried: Incorrect username. Check BMC configuration and try again.
2020-06-01T13:56:47.117859+00:00 jenkins-slave-2 maas.power: [error] pidgey: Could not query power state: Incorrect username. Check BMC configuration and try again..
2020-06-01T13:56:47.134902+00:00 jenkins-slave-2 maas.power: [error] natasha: Power state could not be queried: Incorrect password. Check BMC configuration and try again.
2020-06-01T13:56:47.176709+00:00 jenkins-slave-2 maas.power: [error] natasha: Could not query power state: Incorrect password. Check BMC configuration and try again..
2020-06-01T13:56:47.190883+00:00 jenkins-slave-2 maas.power: [error] opelt: Power state could not be queried: Incorrect password. Check BMC configuration and try again.
2020-06-01T13:56:47.271531+00:00 jenkins-slave-2 maas.power: [error] opelt: Could not query power state: Incorrect password. Check BMC configuration and try again..

At the same time, there's one 'machine create' websocket notification for each of those machines.

I see more of those errors in rackd.log, but so far there's only been one 'machine create' per machine.

Alberto Donato (ack) on 2020-06-04

Changed in maas:
milestone:	2.8.0rc1 → 2.8.0

Revision history for this message

Björn Tillenius (bjornt) wrote on 2020-06-05:

I'm quite sure it's related to the power checks, but I still haven't found the root cause.

What I've found so far is that when a power check fails, it the rack issues an RPC call to the region to create a POWER_QUERY_FAILED event. If I comment out that RPC call, I don't see any issues anymore.

I haven't been able to reproduce this in a unit test yet, but I can reproduce it locally. I've confirmed that the database triggers seem ok, since if I listen to all the machine notifications, I only see a 'machine_update' notification.

So the problem should be somewhere in the websocket code.

Changed in maas:
status:	Incomplete → In Progress

Revision history for this message

Björn Tillenius (bjornt) wrote on 2020-06-05:

Ok, after further investigation, I think this partly is a UI issue. At least the UI caused this regression to be visible.

I suspect that the UI no longer reloads the listing after a disconnect and reconnect. The UI need to refresh its state every time when it gets a successful connection to the websocket, otherwise it might be in a bad state.

But we also need to rethink how we do the caching in the websocket handler. It replaces the cache every time it gets a list() call, which means that if you get the machine list in batches, we only keep track of the last batch.

tags:

added: ui

Björn Tillenius (bjornt) on 2020-06-05

description:

updated

Caleb Ellis (caleb-ellis) on 2020-06-11

Changed in maas-ui:
importance:	Undecided → Unknown
status:	New → Unknown

Caleb Ellis (caleb-ellis) on 2020-06-11

Changed in maas:
status:	In Progress → Fix Committed

Bug Watch Updater (bug-watch-updater) on 2020-06-11

Changed in maas-ui:
status:	Unknown → Fix Released

Alberto Donato (ack) on 2020-06-11

Changed in maas:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

maas-ui-bugs #1211
[closed Priority: High Bug 🐛] Edit

Bug watches keep track of this bug in other bug trackers.