Node is stuck "exiting rescue mode" from failed deployment

Bug #1771777 reported by Alberto Donato
74
This bug affects 17 people
Affects Status Importance Assigned to Milestone
MAAS
Status tracked in 3.6
3.5
Won't Fix
Medium
Unassigned
3.6
Triaged
Medium
Unassigned

Bug Description

After a failed deployment, I put the node in rescue mode to access the curtin config.
This works, but when I try to exit rescue mode, the node is stuck in that state.

Looking at the code, it seems we don't handle the case where the previous state of the node was FAILED_DEPLOYMENT

What I did was:
- deploy a node
- deploy failed, so state is FAILED_DEPLOYMENt
- select "enter rescue mode" from the UI
- after checking the node via ssh, I selected "exit rescue mode"

Alberto Donato (ack)
description: updated
Revision history for this message
Andres Rodriguez (andreserl) wrote :

I've been able to reproduce the same issue, although i have not had any error logs. This may be a backend issue that we are not tracking correctly.

Changed in maas:
importance: Undecided → High
status: New → Triaged
milestone: none → 2.5.0alpha2
summary: - Node is stuck "exiting rescue mode" from failed deployment
+ [2.5] Node is stuck "exiting rescue mode" from failed deployment
Changed in maas:
milestone: 2.5.0alpha2 → 2.5.x
Changed in maas:
milestone: 2.5.x → 2.5.0beta1
Changed in maas:
milestone: 2.5.0beta1 → 2.5.0beta2
Changed in maas:
milestone: 2.5.0beta2 → 2.5.0rc1
Changed in maas:
milestone: 2.5.0rc1 → 2.5.0
Changed in maas:
milestone: 2.5.0 → 2.5.x
Revision history for this message
Pedro Guimarães (pguimaraes) wrote : Re: [2.5] Node is stuck "exiting rescue mode" from failed deployment

Hi, I am also affected by this bug. I was planning to just change the status of the machine on Postgres from EXITING (status 19) to READY (status 4). Comparing with an already "Ready" machine, I could see that the SQL would look like:

UPDATE maasserver_node SET status=4, osystem='', distro_series='', hwe_kernel='', agent_name='' WHERE id = ID

What stopped me from doing so was the unknown field "agent_name". I couldn't find any relevant info on maasserver's source code for agent_name but I worried that would cause inconsistencies on the database.

Could I solve this problem with that SQL instead of kick the machine out and recommissioning (way slower process)?

Revision history for this message
Andres Rodriguez (andreserl) wrote :

Hi Pedro,

You can reset the agent_name back to nothing of you are making this machine go back to failed deployment or ready state.

That said, can you reproduce this reliably?

Changed in maas:
status: Triaged → Incomplete
milestone: 2.5.x → 2.6.0
Revision history for this message
Pedro Guimarães (pguimaraes) wrote :

Hi Andres, no I didn't try that yet. I do use "Test Hardware" option and that resets machine's state to "Ready", from "Rescue Mode", not "Exiting Rescue Mode".

Revision history for this message
Peter Sabaini (peter-sabaini) wrote :

I might have hit this issue. After a failed deployment entered rescue mode, then attempted to exit; it's hanging at "Exiting Rescue Mode" for some time now. This is on Maas 2.3.5-6511-gf466fdb-0ubuntu1 / xenial however.

I've put up some logs here: https://private-fileshare.canonical.com/~sabaini/lp1771777/maas-logs.tar.gz

Revision history for this message
Peter Sabaini (peter-sabaini) wrote :

PS: the failing node is system_id=ka67eg

Revision history for this message
Trent Lloyd (lathiat) wrote :

I had this happen on MAAS 2.3.5 where I also entered rescue mode from failed deployment.

After selecting exit rescue mode the machine was not powering down automatically. However I SSH in and ran poweroff, and then once stopped did check power in MAAS (this part may not have been necessary). This caused it to enter back into the failed deployment state.

Revision history for this message
Shawn Weeks (absolutesantaja) wrote :

Ran into the same issue, failed deployment loop, switched to rescue mode, had to force a shutdown on the host and check power to get it out of rescue mode. This is on MAAS 2.6.0.

Revision history for this message
Shawn Weeks (absolutesantaja) wrote :

Can confirm this is still happening with 2.6.2 (7841-ga10625be3-0ubuntu1~18.04.1)

Changed in maas:
milestone: 2.6.0 → none
status: Incomplete → Triaged
Revision history for this message
Jeremy Mordkoff (jeremy-mordkoff) wrote :

still happens with MAAS 2.8 installed from snap

root@maas3:~# snap list
Name Version Rev Tracking Publisher Notes
core18 20200427 1754 latest/stable canonical* base
maas 2.8.0-8557-g.1f4b79007 7328 2.8/stable canonical* -
maas-cli 0.6.5 13 latest/stable canonical* -
maas-test-db 10.6-14-g.70a88b9 34 2.8/stable canonical* -
snapd 2.45.1 8140 latest/stable canonical* snapd

However the workaround is simple -- just power off the machine yourself.

Revision history for this message
Igor Gnip (igorgnip) wrote :

Same issue here with maas 2.8.2

Revision history for this message
Igor Gnip (igorgnip) wrote :

machine was still reachable on ssh so I managed to issue "poweroff" command and that caused server to detect error with power controlling the server and allowed me to mark broken, then mark fixed and it was finally in ready state.

Revision history for this message
Tom Mercelis (tom-mercelis) wrote :

using
maas 3.0.0-10029-g.986ea3e45 15003 3.0/stable canonical✓
I got via the same scenario: Deploy host, deployment failed (something with being unable to find the /boot/efi/ partition for grub installation); went into rescue mode to wipe the partition table. With the host still on tried to exit rescue mode: stuck in "Exiting rescue mode". Ssh-ed back into the host did a sudo shutdown now from there, in maas clicked "Check power status", it detect the host is now Power off, but the host is still stuck in "Exiting rescue mode"

Time Event
Fri, 27 Aug. 2021 12:16:49 Node - Exited rescue mode on 'blade-f5'.
Fri, 27 Aug. 2021 12:16:49 Node changed status - From 'Failed to exit rescue mode' to 'Exiting rescue mode'
Fri, 27 Aug. 2021 12:16:49 User stopping rescue mode - (tmerceli)
Fri, 27 Aug. 2021 12:16:32 Node changed status - From 'Exiting rescue mode' to 'Failed to exit rescue mode'
Fri, 27 Aug. 2021 12:16:32 Failed exiting rescue mode
Fri, 27 Aug. 2021 12:10:20 Node - Exited rescue mode on 'blade-f5'.
Fri, 27 Aug. 2021 12:10:20 Node changed status - From 'Rescue mode' to 'Exiting rescue mode'
Fri, 27 Aug. 2021 12:10:20 User stopping rescue mode - (tmerceli)
Fri, 27 Aug. 2021 11:58:15 Node status event - 'cloudinit' running config-phone-home with frequency once-per-instance

summary: - [2.5] Node is stuck "exiting rescue mode" from failed deployment
+ Node is stuck "exiting rescue mode" from failed deployment
Changed in maas:
importance: High → Medium
milestone: none → 3.4.0
Revision history for this message
Tobias McNulty (tobias-mcnulty) wrote :

I hit this on MAAS 3.2 today too. Fix was:

- Power on the machine manually. Hopefully it goes to "Exiting rescue mode failed" status (this step is key)
- Mark the machine broken
- Mark the machine fixed
- Now it can be deployed again

Alberto Donato (ack)
Changed in maas:
milestone: 3.4.0 → 3.4.x
Revision history for this message
Navdeep (navdeep-bjn) wrote (last edit ):

This is a issue in 3.3 as well. I took a failed node to rescue mode to check on the failure. after I was done I checked "exit rescue mode". the machine powered off but the node in MaaS UI still said "Exiting rescue mode" and never exited that state. There was no option to "Mark the machine broken" in Machine menu. I tried "Mark the machine broken" from the Machines dashboard and it errored out.

Changed in maas:
milestone: 3.4.x → 3.5.x
Revision history for this message
Giovanni Tirloni (gtirloni) wrote :

I've hitting the same issue in 3.4.0 - I can SSH and run `poweroff` which allows it to go back to failed deployment status so I can perform other actions on it. Before that, it's stuck forever.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.