Machine listing got messed up

Bug #1881275 reported by Björn Tillenius
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
High
Björn Tillenius
maas-ui
Fix Released
Unknown

Bug Description

This is with MAAS 2.8.0~beta4 (8510-g.f5a42ccea)

I'm not sure what happened, but the machine listing got in a weird state:

  https://private-fileshare.canonical.com/~bjorn/too-many-machines.png

There should be 19 machines and 2 resource pools, but it instead showed
57 machines and 4 resource pools.

This happened on two browser tabs that both had the machine listing open.
I'm not sure what triggered it. But I see that the MAAS snap was refreshed
and thus caused a restart to happen, so it might be related to that.

In the logs I see that MAAS was updated from revision 6683 to 6723.

After debugging, it seems that the UI is not refreshing its state when it loses connection to the server and reconnects again.

Tags: ui
Revision history for this message
Björn Tillenius (bjornt) wrote :

I did try to simply restart the MAAS server, and that didn't trigger any problems with the listing. I wonder if there were any UI changes in the snap refresh that caused it?

tags: added: ui
Changed in maas:
milestone: none → 2.8.0rc1
Revision history for this message
Björn Tillenius (bjornt) wrote :

Note that this could be a backend bug. I do see some 'machine create' websocket notifications, but I'm not sure what's causing them.

Revision history for this message
Björn Tillenius (bjornt) wrote :

I'm marking this as Incomplete for now, since we don't know whether it's ui or backend bug. At the moment I think it's a backend bug, but I'm still debugging.

Changed in maas:
status: New → Incomplete
importance: Undecided → High
Revision history for this message
Björn Tillenius (bjornt) wrote :

It seems like the error is related to MAAS starting up. I tried having a tab open for MAAS where MAAS had been already started a long time ago. I couldn't reproduce the error.

But after leaving the tab open on the machine listing and then restarting MAAS, I see the error happening within a minute after the listing is showing again.

I don't see anything obvious in the logs, but this looks odd:

  2020-06-01 10:40:41 stderr: [error] request to http://127.0.0.1:5240/MAAS/metadata/2012-03-01/ failed. sleeping 4.: HTTP Error 503: Service Unavailable

It keeps trying, but never succeeds.

Changed in maas:
assignee: nobody → Björn Tillenius (bjornt)
tags: removed: ui
Revision history for this message
Björn Tillenius (bjornt) wrote :

Another datapoint. MAAS failed to query the BMC for three machines, pidgey, natasha, and opelt:

2020-06-01T13:56:47.082420+00:00 jenkins-slave-2 maas.power: [error] pidgey: Power state could not be queried: Incorrect username. Check BMC configuration and try again.
2020-06-01T13:56:47.117859+00:00 jenkins-slave-2 maas.power: [error] pidgey: Could not query power state: Incorrect username. Check BMC configuration and try again..
2020-06-01T13:56:47.134902+00:00 jenkins-slave-2 maas.power: [error] natasha: Power state could not be queried: Incorrect password. Check BMC configuration and try again.
2020-06-01T13:56:47.176709+00:00 jenkins-slave-2 maas.power: [error] natasha: Could not query power state: Incorrect password. Check BMC configuration and try again..
2020-06-01T13:56:47.190883+00:00 jenkins-slave-2 maas.power: [error] opelt: Power state could not be queried: Incorrect password. Check BMC configuration and try again.
2020-06-01T13:56:47.271531+00:00 jenkins-slave-2 maas.power: [error] opelt: Could not query power state: Incorrect password. Check BMC configuration and try again..

At the same time, there's one 'machine create' websocket notification for each of those machines.

I see more of those errors in rackd.log, but so far there's only been one 'machine create' per machine.

Alberto Donato (ack)
Changed in maas:
milestone: 2.8.0rc1 → 2.8.0
Revision history for this message
Björn Tillenius (bjornt) wrote :

I'm quite sure it's related to the power checks, but I still haven't found the root cause.

What I've found so far is that when a power check fails, it the rack issues an RPC call to the region to create a POWER_QUERY_FAILED event. If I comment out that RPC call, I don't see any issues anymore.

I haven't been able to reproduce this in a unit test yet, but I can reproduce it locally. I've confirmed that the database triggers seem ok, since if I listen to all the machine notifications, I only see a 'machine_update' notification.

So the problem should be somewhere in the websocket code.

Changed in maas:
status: Incomplete → In Progress
Revision history for this message
Björn Tillenius (bjornt) wrote :

Ok, after further investigation, I think this partly is a UI issue. At least the UI caused this regression to be visible.

I suspect that the UI no longer reloads the listing after a disconnect and reconnect. The UI need to refresh its state every time when it gets a successful connection to the websocket, otherwise it might be in a bad state.

But we also need to rethink how we do the caching in the websocket handler. It replaces the cache every time it gets a list() call, which means that if you get the machine list in batches, we only keep track of the last batch.

tags: added: ui
description: updated
Changed in maas-ui:
importance: Undecided → Unknown
status: New → Unknown
Changed in maas:
status: In Progress → Fix Committed
Changed in maas-ui:
status: Unknown → Fix Released
Alberto Donato (ack)
Changed in maas:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.