controller unable to unlock after reinstall (inv_state stuck at reinstalling)

Bug #1890970 reported by Anujeyan Manokeran
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Eric MacDonald

Bug Description

Brief Description
----------------------
Controller-1 was reinstalled successfully, however inv_state stuck at reinstalling, and unlock failed due to inv_stat was in stuck state.
Automation test logs http://128.224.150.21/auto_logs/wp_8_12/202008081139

| install_output | text |
| install_state | completed |
| install_state_info | None |
| inv_state | reinstalling |

Unlock was failed with following error
  raise exceptions.CLIRejected(out)
 E utils.exceptions.CLIRejected: CLI command is rejected.
 E Details: Can not unlock host controller-1 undergoing reinstall. Please ensure host has completed reinstall prior to unlock.

Steps to Reproduce
------------------
1.Install system with latest load.
2. lock Controller host
3. host reinstall
3. unlock controller

System Configuration
--------------------
Regular system AIO+DX + worker
Expected Behavior
------------------
Unlock success after reinstall.
Actual Behavior
----------------
As description says unlocked failure

Reproducibility
---------------
Seen only once in this lab.

Load
----

Build date 2020-08-07_20-00-00

Last Pass
---------

Timestamp/Logs
--------------
Reinstall command :
2020-08-08T16:27:37.000 controller-0 -sh: info HISTORY: PID=506243 UID=42425 system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[face::1]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-reinstall controller-1

Unlock command :
2020-08-08T16:31:03.000 controller-0 -sh: info HISTORY: PID=506243 UID=42425 system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[face::1]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-unlock controller-1

Test Activity
-------------
Regression test

Revision history for this message
Anujeyan Manokeran (anujeyan) wrote :
tags: added: stx.retestneeded
Revision history for this message
Frank Miller (sensfan22) wrote :

Please provide a timestamp for the reinstall step as well as the unlock step.

Also please update the reproducible comment as it doesn't seem to apply (ie no subcloud in this test)

summary: - inv_state was stucked state after controller reinstall unable to unlock
+ controller unable to unlock after reinstall (inv_state stuck at
+ reinstalling)
description: updated
description: updated
description: updated
Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.5.0 / medium priority - As per Yang Liu, this is a new automated test-case. The issue is intermittent, but should be investigated/addressed.

tags: added: stx.5.0 stx.config
Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
assignee: nobody → Yuxing (yuxing)
Revision history for this message
Yuxing (yuxing) wrote :

The inv_state blocks the unlock is expected by design, the unlock should wait the inv_state changes to "inventoried". The test step should make a change for this.

But the inv_state halted for more than 30 mins is not expected. Possible reason is:

After the reinstall steps are completed by BMC, there is a rpc problem happened which make the inv_state stay in "reinstalling" forever, a workaround is a reboot of the controller. May need to add a timeout for the inventory job.

tags: added: stx.update
Revision history for this message
Yuxing (yuxing) wrote :

Logs in sysinv.log on controller-0:
sysinv 2020-08-08 16:27:41.261 ..."hostname": "controller-1", "iscsi_initiator_name": "iqn.1994-05.com.redhat:4239c5454380", "capabilities": {"stor_function": "monitor"}, "install_output": "text", "device_image_update": null, "location": {}, "availability": "online", "invprovision": "provisioned", "peer_id": null, "administrative": "locked", "personality": "controller", "recordtype": "standard", "reboot_needed": false, "bm_mac": null, "inv_state": "reinstalling"...

...

sysinv 2020-08-08 16:31:04.276 106281 WARNING wsme.api [-] Client-side error: Can not unlock host controller-1 undergoing reinstall. Please ensure host has completed reinstall prior to unlock.: ClientSideError: Can not unlock host controller-1 undergoing reinstall. Please ensure host has completed reinstall prior to unlock.

No patch between this two logs indicate the change to the "inv_state" from "reinstalling"

Revision history for this message
Yuxing (yuxing) wrote :

The kern.log in controller-1 is abnormal, no log after the host reinstall:
2020-08-08T16:26:14.397 controller-1 kernel: info [ 854.250395] EXT4-fs (dm-9): resizing filesystem from 6553600 to 13107200 blocks
2020-08-08T16:26:14.460 controller-1 kernel: info [ 854.313417] EXT4-fs (dm-9): resized filesystem to 13107200
2020-08-08T17:13:16.275 controller-1 kernel: info [ 3669.183553] Rounding down aligned max_sectors from 4294967295 to 4294967288

Revision history for this message
Ghada Khalil (gkhalil) wrote :

This error looks similar to the one reported in: https://bugs.launchpad.net/starlingx/+bug/1912623

Revision history for this message
John Kung (john-kung) wrote :

This issue has been resolved by https://bugs.launchpad.net/starlingx/+bug/1865087

As per note from Eric:
The installation issue that occurred here is a result of a dysfunctional BMC behavior that accepts and passes a power off command but does not actually power off the host.

This issue is identical to and represented by the following launchpad and only ever seen with wolfpass servers.

The collect logs show that this system was in fact a wolfpass based system yow-cgcs-wolfpass-16-17

https://bugs.launchpad.net/starlingx/+bug/1865087 - Power off host operation reports completed even if host remains powered on

The fix for this issue is implemented by the following merged (Nov 23 10:36 AM) update.

https://review.opendev.org/c/starlingx/metal/+/763421 - Make Mtce Power-Off FSM verify power-off

If this issue occurred at a customer site there is argument that the 'power control' issue is due to a bug in the server's BMC.

Changed in starlingx:
status: Triaged → Fix Released
assignee: Yuxing (yuxing) → nobody
Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: nobody → Eric MacDonald (rocksolidmtce)
Ghada Khalil (gkhalil)
tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.