"dcmanager fw-update-strategy apply" subcloud failed, but the subcloud still gets lock/unlocked

Bug #1890502 reported by Difu Hu
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Eric MacDonald

Bug Description

Brief Description
-----------------
"dcmanager fw-update-strategy apply" subcloud failed, but the subcloud still gets lock/unlock
Thus, user loses best debugging chance.

Severity
--------
Major

Steps to Reproduce
------------------
precondition: subcloud3 FPGA has flashed root-key image
"dcmanager fw-update-strategy apply" to update subcloud3 FPGA with an unsigned image

Expected Behavior
------------------
subcloud3 FPGA updates failed, and the host keeps there without lock/unlock
Then user can debug on it.

Actual Behavior
----------------
subcloud3 FPGA update failed, but the subcloud still gets lock/unlock
seems /var/log/kern.log got cleared. Not sure whether other more info got cleared.

Reproducibility
---------------
permanent

System Configuration
--------------------
Lab-name: DC-3

Branch/Pull Time/Commit
-----------------------
2020-07-31_20-00-00

Last Pass
---------
N/A

Timestamp/Logs
--------------
on SystemController:
[sysadmin@controller-0 ~(keystone_admin)]$ dcmanager strategy-step list
+-----------+-------+---------+----------------------------------------------------------+----------------------------+----------------------------+
| cloud | stage | state | details | started_at | finished_at |
+-----------+-------+---------+----------------------------------------------------------+----------------------------+----------------------------+
| subcloud3 | 2 | failed | finishing fw update: Not all images applied successfully | 2020-08-05 21:20:08.708953 | 2020-08-05 21:47:41.528227 |

on subcloud3:

[sysadmin@controller-0 ~(keystone_admin)]$ system device-image-state-list
+--------------+----------+--------------------------------------+--------+----------------+----------------------------------+
| hostname | PCI | Device image uuid | status | Update start | updated_at |
| | device | | | time | |
| | address | | | | |
+--------------+----------+--------------------------------------+--------+----------------+----------------------------------+
| controller-0 | 0000:b4: | 06265036-f51c-4279-bd56-d6798c99bdda | failed | 2020-08-05T21: | 2020-08-05T21:35:59.161456+00:00 |
| | 00.0 | | | 20:39. | |
| | | | | 537782+00:00 | |
| | | | | | |
+--------------+----------+--------------------------------------+--------+----------------+----------------------------------+

[sysadmin@controller-0 ~(keystone_admin)]$ date
Wed Aug 5 21:55:22 UTC 2020

[sysadmin@controller-0 ~(keystone_admin)]$ system host-show controller-0
+-----------------------+----------------------------------------------------------------------+
| Property | Value |
+-----------------------+----------------------------------------------------------------------+
| action | none |
| administrative | unlocked |
| availability | available |
| bm_ip | None |
| bm_type | none |
| bm_username | None |
| boot_device | /dev/disk/by-path/pci-0000:03:00.0-nvme-1 |
| capabilities | {u'stor_function': u'monitor', u'Personality': u'Controller-Active'} |
| clock_synchronization | ntp |
| config_applied | 75654015-b5e5-4759-acdd-302ebecae9c4 |
| config_status | None |
| config_target | 75654015-b5e5-4759-acdd-302ebecae9c4 |
| console | ttyS0,115200n8 |
| created_at | 2020-08-02T03:28:12.711181+00:00 |
| device_image_update | |
| hostname | controller-0 |
| id | 1 |
| install_output | text |
| install_state | None |
| install_state_info | None |
| inv_state | inventoried |
| invprovision | provisioned |
| location | {} |
| mgmt_ip | fd01:303::3 |
| mgmt_mac | 48:df:37:d6:47:6c |
| operational | enabled |
| personality | controller |
| reboot_needed | False |
| reserved | False |
| rootfs_device | /dev/disk/by-path/pci-0000:03:00.0-nvme-1 |
| serialid | None |
| software_load | 20.06 |
| subfunction_avail | available |
| subfunction_oper | enabled |
| subfunctions | controller,worker |
| task | |
| tboot | false |
| ttys_dcd | None |
| updated_at | 2020-08-05T21:54:56.841046+00:00 |
| uptime | 870 |
| uuid | 56f9e63a-7dca-48aa-977b-8a5cca8137d5 |
| vim_progress_status | services-enabled |
+-----------------------+----------------------------------------------------------------------+

[sysadmin@controller-0 ~(keystone_admin)]$ cat /var/log/kern.log | grep intel-max10
[sysadmin@controller-0 ~(keystone_admin)]$

Test Activity
-------------
Functional Testing

Difu Hu (difuhu)
summary: - subcloud FPGA update failed, but the subcloud still gets lock/unlocked
+ "dcmanager fw-update-strategy apply" subcloud failed, but the subcloud
+ still gets lock/unlocked
Revision history for this message
Difu Hu (difuhu) wrote :
Revision history for this message
Al Bailey (albailey1974) wrote :

This indicates that the VIM apply-strategy returned successfully. As part of this, the VIM does the lock/unlock.

The dc strategy failed after the VIM state is completed because it examined the contents of the device-image-state-list on the subcloud and found a failed image.

This is properly reporting the failure, as far as I can tell.

Revision history for this message
Chris Friesen (cbf123) wrote :

It might make sense to only lock/unlock the host if there has been a successful image write. If the only image write on a host fails, I'm not sure the lock/unlock makes sense since it will cause an outage for no reason.

Revision history for this message
Frank Miller (sensfan22) wrote :

Marking as stx.5.0 gating as this feature is an stx.5.0 deliverable. The nfv-vim should be detecting if the fw update actualy worked or not and not just checking if the strategy applied.

Changed in starlingx:
status: New → Triaged
importance: Undecided → Medium
assignee: nobody → Al Bailey (albailey1974)
tags: added: stx.5.0
Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: Al Bailey (albailey1974) → Eric MacDonald (rocksolidmtce)
Revision history for this message
Ghada Khalil (gkhalil) wrote :

We believe this is fixed by https://review.opendev.org/745759 which merged on 2020-08-12.
@Difu, please re-test and let us know if there are still any issues.

Changed in starlingx:
status: Triaged → Fix Released
Ghada Khalil (gkhalil)
tags: added: stx.nfv
Revision history for this message
Difu Hu (difuhu) wrote :

Verified on build 2020-06-27_18-35-20 with PATCH_0002

tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.