"dcmanager fw-update-strategy apply" updates a subcloud failed, the subcloud can never be updated again

Bug #1890915 reported by Difu Hu
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Eric MacDonald

Bug Description

Brief Description
-----------------
"dcmanager fw-update-strategy apply" updates a subcloud failed, the subcloud can never be updated again.
Tried on both subcloud2 and subcloud4, same result.
Remove label and image on SystemController, next round update still doesn't work.
Remove label and image on both SystemController and the subcloud, lock/unlock the subcloud, next round update still doesn't work.

Severity
--------
Major

Steps to Reproduce
------------------
precondition: subcloud2 FPGA has flashed root-key image
"dcmanager fw-update-strategy apply" to update subcloud2 FPGA with an unsigned image

Expected Behavior
------------------
The subcloud2 FPGA updates failed, the fw strategy failed quickly. And "dcmanager fw-update-strategy show" shows failed with "finishing fw update: Not all images applied successfully". (observed on 2020-07-31_20-00-00)
After remove label and image, doing another round of update should work. (observed on 2020-07-31_20-00-00)

Actual Behavior
----------------
The subcloud2 FPGA updates failed, the fw strategy failed until an hour. And "dcmanager fw-update-strategy show" shows failed with "applying fw update strategy: Timeout applying firmware strategy."
After remove label and image, doing another round of update always fails with "creating fw update strategy: VIM strategy unexpected build state: applying".

Reproducibility
---------------
yes

System Configuration
--------------------
Lab-name: DC-3

Branch/Pull Time/Commit
-----------------------
2020-08-07_20-00-00

Last Pass
---------
2020-07-31_20-00-00

Timestamp/Logs
--------------
### first round update

[sysadmin@controller-0 ~(keystone_admin)]$ dcmanager strategy-step list
+-----------+-------+--------+------------------------------------------------------------------+----------------------------+----------------------------+
| cloud | stage | state | details | started_at | finished_at |
+-----------+-------+--------+------------------------------------------------------------------+----------------------------+----------------------------+
| subcloud2 | 2 | failed | applying fw update strategy: Timeout applying firmware strategy. | 2020-08-08 13:57:43.558955 | 2020-08-08 14:58:42.410606 |
+-----------+-------+--------+------------------------------------------------------------------+----------------------------+----------------------------+

[sysadmin@controller-0 ~(keystone_admin)]$ system --os-auth-url https://[fd01:13::2]:5001/v3 --os-region-name subcloud2 device-image-list
+--------------------------------------+----------------+------------+------------+--------------+---------------+---------------+------+-------------+---------------+---------+-------------------------+
| uuid | bitstream_type | pci_vendor | pci_device | bitstream_id | key_signature | revoke_key_id | name | description | image_version | applied | applied_labels |
+--------------------------------------+----------------+------------+------------+--------------+---------------+---------------+------+-------------+---------------+---------+-------------------------+
| 2960a627-41fb-42c2-8e55-daea2817df3d | functional | 8086 | 0b30 | 1 | None | None | None | None | None | True | [{u'subcloud2': u'28'}] |
+--------------------------------------+----------------+------------+------------+--------------+---------------+---------------+------+-------------+---------------+---------+-------------------------+

[sysadmin@controller-0 ~(keystone_admin)]$ system --os-auth-url https://[fd01:13::2]:5001/v3 --os-region-name subcloud2 device-image-state-list
+--------------+----------+--------------------------------------+--------+-----------------+----------------------------------+
| hostname | PCI | Device image uuid | status | Update start | updated_at |
| | device | | | time | |
| | address | | | | |
+--------------+----------+--------------------------------------+--------+-----------------+----------------------------------+
| controller-0 | 0000:b2: | 2960a627-41fb-42c2-8e55-daea2817df3d | failed | 2020-08-08T13: | 2020-08-08T14:13:28.650394+00:00 |
| | 00.0 | | | 58:14.621556+00 | |
| | | | | :00 | |
| | | | | | |
+--------------+----------+--------------------------------------+--------+-----------------+----------------------------------+

[sysadmin@controller-0 ~(keystone_admin)]$ fm --os-auth-url https://[fd01:13::2]:5001/v3 --os-region-name subcloud2 alarm-list
+-------+---------------------------------------------------------+--------------------------------------+----------+--------------+
| Alarm | Reason Text | Entity ID | Severity | Time Stamp |
| ID | | | | |
+-------+---------------------------------------------------------+--------------------------------------+----------+--------------+
| 900. | Firmware update auto-apply inprogress | orchestration=fw-update | major | 2020-08-08T1 |
| 301 | | | | 3:58:14. |
| | | | | 298101 |
| | | | | |
| 900. | Device image update operation in progress | system=49038d7a- | minor | 2020-08-08T1 |
| 006 | | 59c7-462b-a109-3f6e3559a58d | | 3:57:46. |
| | | | | 679773 |

### After remove label and image on SystemController, second round update

[sysadmin@controller-0 ~(keystone_admin)]$ dcmanager strategy-step list
+-----------+-------+--------+----------------------------------------------------------------------------+----------------------------+----------------------------+
| cloud | stage | state | details | started_at | finished_at |
+-----------+-------+--------+----------------------------------------------------------------------------+----------------------------+----------------------------+
| subcloud2 | 2 | failed | creating fw update strategy: VIM strategy unexpected build state: applying | 2020-08-08 17:13:31.668916 | 2020-08-08 17:13:52.868613 |
+-----------+-------+--------+----------------------------------------------------------------------------+----------------------------+----------------------------+

[sysadmin@controller-0 ~(keystone_admin)]$ system --os-auth-url https://[fd01:13::2]:5001/v3 --os-region-name subcloud2 device-image-list
+--------------------------------------+----------------+------------+------------+--------------+---------------+---------------+------+-------------+---------------+---------+-------------------------+
| uuid | bitstream_type | pci_vendor | pci_device | bitstream_id | key_signature | revoke_key_id | name | description | image_version | applied | applied_labels |
+--------------------------------------+----------------+------------+------------+--------------+---------------+---------------+------+-------------+---------------+---------+-------------------------+
| 2960a627-41fb-42c2-8e55-daea2817df3d | functional | 8086 | 0b30 | 1 | None | None | None | None | None | False | None |
| 35980675-5836-451c-8bbc-d224390cf526 | functional | 8086 | 0b30 | 2 | None | None | None | None | None | True | [{u'subcloud2': u'28'}] |
+--------------------------------------+----------------+------------+------------+--------------+---------------+---------------+------+-------------+---------------+---------+-------------------------+

[sysadmin@controller-0 ~(keystone_admin)]$ system --os-auth-url https://[fd01:13::2]:5001/v3 --os-region-name subcloud2 device-image-state-list
+--------------+--------------+--------------------------------------+---------+-------------------+------------+
| hostname | PCI device | Device image uuid | status | Update start time | updated_at |
| | address | | | | |
+--------------+--------------+--------------------------------------+---------+-------------------+------------+
| controller-0 | 0000:b2:00.0 | 35980675-5836-451c-8bbc-d224390cf526 | pending | None | None |
+--------------+--------------+--------------------------------------+---------+-------------------+------------+

[sysadmin@controller-0 ~(keystone_admin)]$ fm --os-auth-url https://[fd01:13::2]:5001/v3 --os-region-name subcloud2 alarm-list --mgmt_affecting
+-------+---------------------------------------+--------------------------------------+----------+----------------------+-------------+
| Alarm | Reason Text | Entity ID | Severity | Management Affecting | Time Stamp |
| ID | | | | | |
+-------+---------------------------------------+--------------------------------------+----------+----------------------+-------------+
| 900. | Firmware update auto-apply inprogress | orchestration=fw-update | major | True | 2020-08-08T |
| 301 | | | | | 17:22:00. |
| | | | | | 750332 |
| | | | | | |
| 900. | Device image update operation in | system=49038d7a- | minor | True | 2020-08-08T |
| 006 | progress | 59c7-462b-a109-3f6e3559a58d | | | 17:13:35. |
| | | | | | 644574 |

Test Activity
-------------
Functional Testing

Revision history for this message
Difu Hu (difuhu) wrote :
Difu Hu (difuhu)
summary: - "dcmanager fw-update-strategy apply" update a subcloud failed, the
+ "dcmanager fw-update-strategy apply" updates a subcloud failed, the
subcloud can never be updated again
description: updated
Difu Hu (difuhu)
description: updated
Difu Hu (difuhu)
description: updated
Ghada Khalil (gkhalil)
tags: added: stx.5.0
Changed in starlingx:
assignee: nobody → Eric MacDonald (rocksolidmtce)
Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

VIM does not currently support host level device_image_update state transition from 'in-progress' to 'pending' which is what happened when the lower level code aborted the update.

VIM will be modified to treat a 'in-progress' to 'pending' as an fwupdate FAILURE at both the 'host' and overall 'strategy' levels.

VIM will also be modified to gracefully handle device image update abort exception if called when there is nothing to abort. This exception case has been reproduced.

Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → High
status: New → Triaged
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nfv (master)

Fix proposed to branch: master
Review: https://review.opendev.org/745759

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nfv (master)

Reviewed: https://review.opendev.org/745759
Committed: https://git.openstack.org/cgit/starlingx/nfv/commit/?id=1deaa07313f028c8837c428b536918c9d18d22be
Submitter: Zuul
Branch: master

commit 1deaa07313f028c8837c428b536918c9d18d22be
Author: Eric MacDonald <email address hidden>
Date: Tue Aug 11 19:16:19 2020 -0400

    Add in-progress to pending status change handling to VIM fwupdate orch

    A host level device_image_update status, during fwupdate, can change
    from 'in-progress' to 'pending' as a result of an update failure.

    This update adds this failure handling state transition case to the
    VIM's fwupdate orchestration service.

    Change-Id: If7492bb11a0452330652b833e952754399a47d8f
    Closes-Bug: 1890915
    Signed-off-by: Eric MacDonald <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
tags: added: stx.nfv
Revision history for this message
Difu Hu (difuhu) wrote :

Verified on build 2020-06-27_18-35-20 with PATCH_0002

tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.