Controller Commissioning status is Failed

Bug #2066182 reported by Marian Gasparovic
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Invalid
Undecided
Unassigned
3.2
Invalid
Undecided
Unassigned

Bug Description

MAAS 3.2.11 testing

During MAAS setup we wait for controllers to have commissioning status Passed. However we encountered two test runs where at least one controller reported Failed status

2024-05-16-13:53:07 root DEBUG [localhost]: maas root rack-controllers read
2024-05-16-13:53:12 foundationcloudengine.layers.configuremaas INFO Controller infra1: Commissioning status is Passed
2024-05-16-13:53:12 foundationcloudengine.layers.configuremaas INFO Controller infra2: Commissioning status is Passed
2024-05-16-13:53:12 foundationcloudengine.layers.configuremaas INFO Controller infra3: Commissioning status is Failed

Link to logs and artifacts
- https://oil-jenkins.canonical.com/artifacts/78e84027-f246-4dbb-a02f-12e9408fc41b/index.html

- https://oil-jenkins.canonical.com/artifacts/60b9ba60-7adb-4f55-b33b-5b4a585e65d1/index.html

Tags: cdo-qa sts
Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

@ https://oil-jenkins.canonical.com/artifacts/78e84027-f246-4dbb-a02f-12e9408fc41b/fce_build_layer_30353/fce_build_layer_30353_maas_console.out

2024-05-16-13:53:07 fce.maas INFO Starting step: maas:wait_for_controllers
2024-05-16-13:53:07 root DEBUG [localhost]: maas root rack-controllers read
2024-05-16-13:53:12 foundationcloudengine.layers.configuremaas INFO Controller infra1: Commissioning status is Passed
2024-05-16-13:53:12 foundationcloudengine.layers.configuremaas INFO Controller infra2: Commissioning status is Passed
2024-05-16-13:53:12 foundationcloudengine.layers.configuremaas INFO Controller infra3: Commissioning status is Failed

@ https://oil-jenkins.canonical.com/artifacts/60b9ba60-7adb-4f55-b33b-5b4a585e65d1/fce_build_layer_30316/fce_build_layer_30316_maas_console.out

2024-05-16-10:01:28 fce.maas INFO Starting step: maas:wait_for_controllers
2024-05-16-10:01:28 root DEBUG [localhost]: maas root rack-controllers read
2024-05-16-10:01:32 foundationcloudengine.layers.configuremaas INFO Controller noma: Commissioning status is Passed
2024-05-16-10:01:32 foundationcloudengine.layers.configuremaas INFO Controller sunset: Commissioning status is Failed

Revision history for this message
Seyeong Kim (seyeongkim) wrote :

I'm taking a look at the logs but nothing special except this.

https://pastebin.canonical.com/p/m92hBbPqdm/

how many times have you tried to run test?

let me check the error..

tags: added: sts
Revision history for this message
Seyeong Kim (seyeongkim) wrote :

Today, I built FCE environment in my local(with MAAS 3.2/edge ), it is slow but passed commissioning status.

2024-05-22-08:15:58 foundationcloudengine.layers.configuremaas INFO Controller infra2: Commissioning status is Passed
2024-05-22-08:15:58 foundationcloudengine.layers.configuremaas INFO Controller infra3: Commissioning status is Passed
2024-05-22-08:15:58 foundationcloudengine.layers.configuremaas INFO Controller infra1: Commissioning status is Passed

maas 3.2.11-12069-g.5996fed96 35544 3.2/edge canonical** -

I'll keep checking mine and link..

Revision history for this message
Seyeong Kim (seyeongkim) wrote :

I had a discussion with Marian, and after several more tries, this didn't happend but the others.

https://bugs.launchpad.net/maas/+bug/1781241
https://bugs.launchpad.net/maas/+bug/2007297

for 2007297, Mauricio is working on this. and I'll also check the error log if there is something I can do.

Thanks.

Revision history for this message
Christian Grabowski (cgrabowski) wrote :

Can you confirm if this also effects newer MAAS versions?

Changed in maas:
status: New → Incomplete
Revision history for this message
Jeffrey Chang (modern911) wrote (last edit ):

SolQA only bump into this with maas 3.1/3.2, and all with snap installations.
All occurrences could be found on https://solutions.qa.canonical.com/bugs/2066182

And we deploy snap maas on jammy for most of them, which means we are deploying with postgresql 14 on jammy, instead of postgresql 12 with focal.

Did some more test for snap maas on focal, with postgresql 12, and this doesn't happen for 20 runs.
Also 3.2.10 has this problem as well, failed on my 4th try.

Changed in maas:
status: Incomplete → Invalid
Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :
Download full text (5.7 KiB)

As Jeffrey mentioned, this is apparently an incompability between MAAS 3.2 and 3.1 with PostgreSQL 14 (the supported version is PostgreSQL 12 for both, on Focal, but Jammy/pgsql 14 was used on snap-based tests).

At a lower level, the issue is one of the 3 rack controllers has a commissioning script failure due to a database relation claimed not to exist, but it is present in the database.

An attempt to work around it was to wait for 5 minutes after `maas init` on the 1st rack controller, so to ensure the database was completely set-up / in-place, before doing `maas init` on the 2nd/3rd rack controllers.

That didn't help (the exact same issue happened with the 5-minute delay), so it didn't seem like a race condition/timing issue.

I noticed the PostgreSQL version (and Ubuntu series) not being the supported ones for MAAS 3.2/3.1 (focal/pgsql12), and confirmed that the issue didn't happen on the tests with debs (on focal/pgsql12), but only in snaps (in jammy/pgsql14).
Even though the distro series might (or is supposed to) be less of an issue for snaps, the database version is external to that, so it could lead to issues.

I asked Jeffrey to verify that hypothesis (and test snaps on focal/pgsql12), which turned out to help.
He further confirmed that the 'issue' also affected 3.2.10, i.e., it was not new/exclusive to the 3.2.11 build being tested.

Marking bug tasks to MAAS 3.2 and devel as Invalid, this was related to an unsupported postgresql version, as far as we can tell.

Thanks!

Some analysis snippets:

$ maas root rack-controllers read | jq -r '.[] | [.hostname, ."commissioning_status_name"]'
[
  "noma",
  "Passed"
]
[
  "sunset",
  "Failed"
]
[
  "anahuac",
  "Passed"
]

$ maas root events query hostname=sunset
...
        {
...
            "level": "INFO",
            "created": "Tue, 30 Jul. 2024 16:40:59",
            "type": "Failed commissioning",
            "description": ""
        },
        {
...
            "level": "INFO",
            "created": "Tue, 30 Jul. 2024 16:40:44",
            "type": "Script",
            "description": "50-maas-01-commissioning failed"
        },
        {
...
            "level": "ERROR",
            "created": "Tue, 30 Jul. 2024 16:40:44",
            "type": "Script result lookup or storage error",
            "description": "sunset.maas(np3dam): commissioning script '50-maas-01-commissioning' failed during post-processing."
        },
        {
...
            "level": "ERROR",
            "created": "Tue, 30 Jul. 2024 16:40:43",
            "type": "Script result lookup or storage error",
            "description": "Failed processing commissioning data: relation \"maasserver_podhost\" does not exist\nLINE 2: FROM maasserver_podhost\n
  ^\nQUERY: SELECT pod_id\n FROM maasserver_podhost\n JOIN maasserver_nodeconfig\n ON maasserver_nodeconfig.node_id = maasserver_podhost.
node_id\n WHERE maasserver_nodeconfig.id = NEW.node_config_id\nCONTEXT: PL/pgSQL function interface_pod_notify() line 7 at SQL statement\n"
        }

High-level summary from it:

 Failed commissioning

 50-maas-01-commissioning failed

 sunset.maas(np3dam): commissioning...

Read more...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.