subcloud3 and subcloud4 kept out-of-sync after "dcmanager fw-update-strategy apply"

Bug #1890295 reported by Difu Hu
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Triaged
Low
Ghada Khalil

Bug Description

Brief Description
-----------------
subcloud3 and subcloud4 kept out-of-sync after "dcmanager fw-update-strategy apply" (more than 2 hours)
subcloud1 and subcloud2 became in-sync as expected.

Severity
--------
Major

Steps to Reproduce
------------------
system --os-region-name SystemController device-image-apply 63d85a33-531f-4714-afc9-555f39f6d62f subcloud=abc subcloud3=3 subcloud4=4
dcmanager fw-update-strategy create --subcloud-apply-type serial
dcmanager fw-update-strategy apply

Expected Behavior
------------------
all 4 subclouds update complete and in-sync

Actual Behavior
----------------
all 4 subclouds update complete, but subcloud3 and subcloud4 keeps out-of-sync

Reproducibility
---------------
not sure

System Configuration
--------------------
Lab-name: DC-3

Branch/Pull Time/Commit
-----------------------
2020-07-31_20-00-00

Last Pass
---------
N/A

Timestamp/Logs
--------------
[sysadmin@controller-0 ~(keystone_admin)]$ system --os-region-name SystemController device-image-apply 63d85a33-531f-4714-afc9-555f39f6d62f subcloud=abc subcloud3=3 subcloud4=4
+----------------+---------------------------------------------------------------------+
| Property | Value |
+----------------+---------------------------------------------------------------------+
| uuid | 63d85a33-531f-4714-afc9-555f39f6d62f |
| bitstream_type | functional |
| pci_vendor | 8086 |
| pci_device | 0b30 |
| bitstream_id | 11 |
| key_signature | None |
| revoke_key_id | None |
| name | None |
| description | None |
| image_version | None |
| applied | True |
| applied_labels | [{u'subcloud': u'abc'}, {u'subcloud3': u'3'}, {u'subcloud4': u'4'}] |
+----------------+---------------------------------------------------------------------+

[sysadmin@controller-0 ~(keystone_admin)]$ dcmanager fw-update-strategy create --subcloud-apply-type serial
+------------------------+----------------------------+
| Field | Value |
+------------------------+----------------------------+
| strategy type | firmware |
| subcloud apply type | serial |
| max parallel subclouds | None |
| stop on failure | False |
| state | initial |
| created_at | 2020-08-03T20:26:14.214028 |
| updated_at | None |
+------------------------+----------------------------+

[sysadmin@controller-0 ~(keystone_admin)]$ dcmanager fw-update-strategy apply
+------------------------+----------------------------+
| Field | Value |
+------------------------+----------------------------+
| strategy type | firmware |
| subcloud apply type | serial |
| max parallel subclouds | None |
| stop on failure | False |
| state | applying |
| created_at | 2020-08-03T20:26:14.214028 |
| updated_at | 2020-08-03T20:26:53.827015 |
+------------------------+----------------------------+

[sysadmin@controller-0 ~(keystone_admin)]$ dcmanager strategy-step list
+-----------+-------+-----------------------------+-----------------------------+----------------------------+----------------------------+
| cloud | stage | state | details | started_at | finished_at |
+-----------+-------+-----------------------------+-----------------------------+----------------------------+----------------------------+
| subcloud1 | 2 | complete | | 2020-08-03 20:26:56.114160 | 2020-08-03 21:12:30.289524 |
| subcloud2 | 3 | complete | | 2020-08-03 21:12:38.813454 | 2020-08-03 21:58:02.459198 |
| subcloud3 | 4 | applying fw update strategy | apply phase is 16% complete | 2020-08-03 21:58:11.055072 | None |
| subcloud4 | 5 | initial | | None | None |
+-----------+-------+-----------------------------+-----------------------------+----------------------------+----------------------------+
[sysadmin@controller-0 ~(keystone_admin)]$ dcmanager strategy-step list
+-----------+-------+----------+---------+----------------------------+----------------------------+
| cloud | stage | state | details | started_at | finished_at |
+-----------+-------+----------+---------+----------------------------+----------------------------+
| subcloud1 | 2 | complete | | 2020-08-03 20:26:56.114160 | 2020-08-03 21:12:30.289524 |
| subcloud2 | 3 | complete | | 2020-08-03 21:12:38.813454 | 2020-08-03 21:58:02.459198 |
| subcloud3 | 4 | complete | | 2020-08-03 21:58:11.055072 | 2020-08-03 22:42:44.662378 |
| subcloud4 | 5 | complete | | 2020-08-03 22:42:53.327048 | 2020-08-03 23:28:47.047164 |
+-----------+-------+----------+---------+----------------------------+----------------------------+

[sysadmin@controller-0 ~(keystone_admin)]$ dcmanager subcloud list
+----+-----------+------------+--------------+---------------+-------------+
| id | name | management | availability | deploy status | sync |
+----+-----------+------------+--------------+---------------+-------------+
| 1 | subcloud1 | managed | online | complete | in-sync |
| 2 | subcloud2 | managed | online | complete | in-sync |
| 3 | subcloud3 | managed | online | complete | out-of-sync |
| 4 | subcloud4 | managed | online | complete | out-of-sync |
+----+-----------+------------+--------------+---------------+-------------+

[sysadmin@controller-0 ~(keystone_admin)]$ date
Tue Aug 4 01:24:50 UTC 2020

[sysadmin@controller-0 ~(keystone_admin)]$ dcmanager subcloud show subcloud3
+-----------------------------+----------------------------+
| Field | Value |
+-----------------------------+----------------------------+
| id | 3 |
| name | subcloud3 |
| description | None |
| location | None |
| software_version | 20.06 |
| management | managed |
| availability | online |
| deploy_status | complete |
| management_subnet | fd01:303::0/64 |
| management_start_ip | fd01:303::2 |
| management_end_ip | fd01:303::11 |
| management_gateway_ip | fd01:303::1 |
| systemcontroller_gateway_ip | fd01:11::1 |
| group_id | 1 |
| created_at | 2020-08-02 03:06:47.540678 |
| updated_at | 2020-08-03 22:40:04.965372 |
| dc-cert_sync_status | in-sync |
| firmware_sync_status | out-of-sync |
| identity_sync_status | in-sync |
| load_sync_status | in-sync |
| patching_sync_status | in-sync |
| platform_sync_status | in-sync |
+-----------------------------+----------------------------+

[sysadmin@controller-0 ~(keystone_admin)]$ dcmanager subcloud show subcloud4
+-----------------------------+----------------------------+
| Field | Value |
+-----------------------------+----------------------------+
| id | 4 |
| name | subcloud4 |
| description | None |
| location | None |
| software_version | 20.06 |
| management | managed |
| availability | online |
| deploy_status | complete |
| management_subnet | fd01:304::0/64 |
| management_start_ip | fd01:304::2 |
| management_end_ip | fd01:304::11 |
| management_gateway_ip | fd01:304::1 |
| systemcontroller_gateway_ip | fd01:11::1 |
| group_id | 1 |
| created_at | 2020-08-02 03:07:00.766138 |
| updated_at | 2020-08-03 23:26:09.679729 |
| dc-cert_sync_status | in-sync |
| firmware_sync_status | out-of-sync |
| identity_sync_status | in-sync |
| load_sync_status | in-sync |
| patching_sync_status | in-sync |
| platform_sync_status | in-sync |
+-----------------------------+----------------------------+

Test Activity
-------------
Functional Testing

Revision history for this message
Difu Hu (difuhu) wrote :
Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: nobody → Jessica Castelino (jcasteli)
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as stx.5.0 - issue related to DC FPGA Orchestration feature

tags: added: stx.5.0 stx.distcloud stx.fpga
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → High
status: New → Triaged
Revision history for this message
Ghada Khalil (gkhalil) wrote :
Download full text (4.9 KiB)

The issue was reproduced on 2020-08-05 - subcloud1 & subcloud2 were in-sync whereas subcloud4 was out-of-sync even though the device image state shows completed.

As per Difu, he was doing 2 rounds of orchestration: the first round all passed, then the second round reported an out-of-sync on subcloud4.

As per Al, the reason subcloud4 is out-of-sync is because there are 2 applied images on the system controller, but only one is applied on subcloud4

On the system controller, there are 2 images applied:

system device-image-list
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

uuid bitstream_t pci_ven pci_dev bitstream_ key_signat revoke_ name description image_ve applied applied_labels
  ype dor ice id ure key_id rsion
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

cbe0f5f7-f1d6-4b1b-8153-28951991d4ee functional 8086 0b30 11 None None None None None True [\{u'subcloud': u'abc'}, \{u'subcloud3': u'3'}, \{u'subcloud4': u'4'}]
ee659fb7-d433-4937-bb5b-f213185b07b5 functional 8086 0b30 2 None None None None None True [\{u'subcloud': u'abc'}, \{u'subcloud3': u'3'}, \{u'subcloud4': u'4'}]
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

On subcloud4, there is only 1 image applied:

system device-image-list
----------------------------------------------------------------------------------------------------------------------------------------------------------------+

uuid bitstream_t pci_ven pci_devic bitstream_id key_signature revoke_key_id name description image_version applied applied_labels
  ype dor e
----------------------------------------------------------------------------------------------------------------------------------------------------------------+

cbe0f5f7-f1d6-4b1b-8153-28951991d4ee functional 8086 0b30 11 None None None None None False None
ee659fb7-d433-4937-bb5b-f213185b07b5 functional 8086 0b30 2 None None None None None True [\{u'subcloud4': u'4'}]
----------------------------------------------------------------------------------------------------------------------------------------------------------------+

Detailed Sequence of events are as follows:

subcloud1: system host-device-label-assign controller-0 0000:b2:00.0 subcloud=abc
subcloud2: system host-device-label-assign controller-0 0000:b2:00.0 subcloud=abc
subcloud3: system host-device-label-assign controller-0 0000:b4:00.0 subcloud3=3
subcloud4: system host-device-label-assign controller-0 0000:b4:00.0 subcloud4=4
system --os-region-name SystemController device-image-upload 5gldpc_1x2x25g_20ww2.3_swap_ddr4_2xrefresh-signed-ssl-csk1.bin functional 8086 0b30 --bitstream-id 11
system --os-region-name SystemController device-image-apply cbe0f5f7-f1d6-4b1b-8153-28951991d4ee subcloud=abc subcloud3=3 subcloud4=4
dcmanager fw-update...

Read more...

Revision history for this message
Ghada Khalil (gkhalil) wrote :

@Difu, please try the test-case again, but remove the old functional image before applying the new one and confirm that the subclouds all go in-sync.

Revision history for this message
Difu Hu (difuhu) wrote :

Tested on build 2020-08-06_20-00-00.

After first round of "dcmanager fw-update-strategy apply", remove label from SystemController image by "system --os-region-name SystemController device-image-remove 997d9e44-42e0-4608-b7b6-0f0d37ac6d00 subcloud=t24".

Then upload another image and apply label, did second round of "dcmanager fw-update-strategy apply"".
The issue didn't occur.

tags: removed: stx.retestneeded
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Lowering the priority as the subclouds remain in-sync with an altered procedure. Will leave open to discuss with Chris Friesen whether we should implementing a longer term solution as per the comments above:

Longer term, we should look at a solution where the system controller doesn't report both images as applied. This needs further investigation as the system controller does not have device image state records, so it does not know to replace the old image with the new image.

Changed in starlingx:
importance: High → Low
assignee: Jessica Castelino (jcasteli) → Ghada Khalil (gkhalil)
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Given this is low priority, it will not hold up stx.5.0

tags: removed: stx.5.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.