Distributed Cloud Ipv6: subcloud patch auto-apply failed

Bug #1856226 reported by Peng Peng
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Invalid
Medium
Al Bailey

Bug Description

Brief Description
-----------------
Applying reboot required patched to DC, three subcloud apply success, but one subcloud appl failed. 900.103 alarm "Software patch auto-apply failed" raised

Severity
--------
Major

Steps to Reproduce
------------------
1. apply/upload PATCH.ENABLE_DEV_CERTIFICATE-19.12 to system controller
2. dcmanager patch-strategy create --subcloud-apply-type parallel --max-parallel-subclouds 10
3. dcmanager patch-strategy apply
4. after patch apply success, dcmanager patch-strategy delete
5. upload/apply 2019-12-08_20-00-00_RR_ALLNODES.patch to system controller
6. dcmanager patch-strategy create --subcloud-apply-type parallel --max-parallel-subclouds 10
7. dcmanager patch-strategy apply
8. dcmanager strategy-step list

TC-name: RR patching test

Expected Behavior
------------------
all subcloud patching apply success

Actual Behavior
----------------
subclou1 patching failed

Reproducibility
---------------
Unknown - first time this is seen in this load

System Configuration
--------------------
DC system
IPv6

Lab-name: DC

Branch/Pull Time/Commit
-----------------------
2019-12-08_20-00-00

Last Pass
---------
2019-10-06 load

Timestamp/Logs
--------------
[sysadmin@controller-0 ~(keystone_admin)]$ dcmanager strategy-step list
+------------------+-------+----------+----------------------------------------------------------------+----------------------------+----------------------------+
| cloud | stage | state | details | started_at | finished_at |
+------------------+-------+----------+----------------------------------------------------------------+----------------------------+----------------------------+
| SystemController | 1 | complete | | 2019-12-12 16:08:43.254742 | 2019-12-12 16:37:27.462675 |
| subcloud1 | 2 | failed | Strategy apply failed for subcloud1 - unexpected state aborted | 2019-12-12 16:37:37.469304 | 2019-12-12 16:41:31.093920 |
| subcloud4 | 2 | complete | | 2019-12-12 16:37:37.480419 | 2019-12-12 17:04:42.760715 |
| subcloud5 | 2 | complete | | 2019-12-12 16:37:37.494256 | 2019-12-12 17:05:52.441619 |
| subcloud6 | 2 | complete | | 2019-12-12 16:37:37.510882 | 2019-12-12 16:50:22.378693 |
+------------------+-------+----------+----------------------------------------------------------------+----------------------------+----------------------------+

Subclud1:
[sysadmin@controller-0 ~(keystone_admin)]$ fm alarm-list
+----------+----------------------------------------+------------------------+----------+-------------------+
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+----------+----------------------------------------+------------------------+----------+-------------------+
| 900.103 | Software patch auto-apply failed | orchestration=sw-patch | critical | 2019-12-12T16:40: |
| | | | | 51.149126 |
| | | | | |
| 900.001 | Patching operation in progress | host=controller | minor | 2019-12-12T16:37: |
| | | | | 54.527125 |
| | | | | |
| 500.101 | Developer patch certificate is enabled | host=controller | critical | 2019-12-12T15:53: |
| | | | | 35.613420 |
| | | | | |
+----------+----------------------------------------+------------------------+----------+-------------------+

Test Activity
-------------
Regression Testing

Revision history for this message
Peng Peng (ppeng) wrote :
Revision history for this message
Peng Peng (ppeng) wrote :
Ghada Khalil (gkhalil)
tags: added: stx.distcloud
Yang Liu (yliu12)
description: updated
tags: added: stx.retestneeded
Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.4.0 / medium priority - Affects a specific config. This appears to only be affecting standard subclouds only. AIO-DX & AIO-SX subclouds don't have this issue.

tags: added: stx.4.0
Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
assignee: nobody → Al Bailey (albailey1974)
Revision history for this message
Peng Peng (ppeng) wrote :

2019-12-13_19-03-42_RR_ALLNODES.patch
BUILD_ID="2019-12-13_19-03-42"

passed, included the same subcloud1

Revision history for this message
Al Bailey (albailey1974) wrote :

This is the error that coincides with the strategy failure

nfv-vim.log:2019-12-12T16:40:51.136 controller-0 VIM_Thread[96683] ERROR Caught exception while trying to lock a host controller-1, error=[OpenStack Rest-API Exception: method=PATCH, url=http://[fd01:2::2]:6385/v1/ihosts/15909817-2150-489e-a477-b76125104d40, headers={'Content-Type': 'application/json', 'User-Agent': 'vim/1.0'}, body=[{"path": "/action", "value": "lock", "op": "replace"}], status_code=400, reason=HTTP Error 400: Bad Request, response_headers=[('Date', 'Thu, 12 Dec 2019 16:40:51 GMT'), ('Content-Length', '251'), ('Content-Type', 'application/json')], response_body={"error_message": "{\"debuginfo\": null, \"faultcode\": \"Client\", \"faultstring\": \"Only 2 storage monitor available. At least 2 unlocked and enabled hosts with monitors are required. Please ensure hosts with monitors are unlocked and enabled.\"}"}].
nfv-vim.log:OpenStackRestAPIException: [OpenStack Rest-API Exception: method=PATCH, url=http://[fd01:2::2]:6385/v1/ihosts/15909817-2150-489e-a477-b76125104d40, headers={'Content-Type': 'application/json', 'User-Agent': 'vim/1.0'}, body=[{"path": "/action", "value": "lock", "op": "replace"}], status_code=400, reason=HTTP Error 400: Bad Request, response_headers=[('Date', 'Thu, 12 Dec 2019 16:40:51 GMT'), ('Content-Length', '251'), ('Content-Type', 'application/json')], response_body={"error_message": "{\"debuginfo\": null, \"faultcode\": \"Client\", \"faultstring\": \"Only 2 storage monitor available. At least 2 unlocked and enabled hosts with monitors are required. Please ensure hosts with monitors are unlocked and enabled.\"}"}]

Revision history for this message
Al Bailey (albailey1974) wrote :

Note: There are no compute-0 logs collected as part of the SubCloud collect

Revision history for this message
Al Bailey (albailey1974) wrote :

Marking as Invalid, since recent loads do not reproduce this.

Changed in starlingx:
status: Triaged → Invalid
Revision history for this message
Peng Peng (ppeng) wrote :

Issue is not reproduced recently. close it for now

tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.