patch orch failed on sx subcloud with oidc and stx-monitor applied - host not unlock

Bug #1876500 reported by Peng Peng
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Don Penney

Bug Description

Brief Description
-----------------
With oidc and stx-monitor apps applied on Distributed cloud system, after using patching orch to apply Large patch on DC, one of SX system patch apply failed by host locked.

Severity
--------
Major

Steps to Reproduce
------------------
applied oidc and stx-monitor app on DC system
apply Large patch on system by using patch strategy
Apply strategy

Expected Behavior
------------------
Patching success on all subcloud

Actual Behavior
----------------
one SX subcloud patching failed

Reproducibility
---------------
Unknown - first time this is seen in sanity, will monitor

System Configuration
--------------------
DC system

Lab-name: WCP_80-91

Branch/Pull Time/Commit
-----------------------
2020-04-29_20-00-00

Last Pass
---------
2020-03-29_16-39-59

Timestamp/Logs
--------------
[sysadmin@controller-0 ~(keystone_admin)]$ system application-list
+--------------------------+---------+-----------------------------------+--------------------------------------+----------+-----------+
| application | version | manifest name | manifest file | status | progress |
+--------------------------+---------+-----------------------------------+--------------------------------------+----------+-----------+
| cert-manager | 1.0-0 | cert-manager-manifest | certmanager-manifest.yaml | applied | completed |
| nginx-ingress-controller | 1.0-0 | nginx-ingress-controller-manifest | nginx_ingress_controller_manifest. | applied | completed |
| | | | yaml | | |
| | | | | | |
| oidc-auth-apps | 1.0-0 | oidc-auth-manifest | manifest.yaml | uploaded | completed |
| platform-integ-apps | 1.0-8 | platform-integration-manifest | manifest.yaml | applied | completed |
| stx-monitor | 1.0-1 | analytics-armada-manifest | wr-analytics.yaml | applied | completed |
+--------------------------+---------+-----------------------------------+--------------------------------------+----------+-----------+
[sysadmin@controller-0 ~(keystone_admin)]$ fm alarm-list
+----------+----------------------------------------------------------------------------------+-------------------+----------+----------------+
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+----------+----------------------------------------------------------------------------------+-------------------+----------+----------------+
| 400.003 | Evaluation license key will expire on 30-sep-2020; there are 152 days remaining | host=controller-1 | minor | 2020-05-01T16: |
| | in this evaluation | | | 43:53.094570 |
| | | | | |
| 400.003 | Evaluation license key will expire on 30-sep-2020; there are 152 days remaining | host=controller-0 | minor | 2020-05-01T16: |
| | in this evaluation | | | 43:49.649907 |
| | | | | |
| 500.101 | Developer patch certificate is enabled | host=controller | critical | 2020-05-01T00: |
| | | | | 06:02.038877 |
| | | | | |
+----------+----------------------------------------------------------------------------------+-------------------+----------+----------------+
[sysadmin@controller-0 ~(keystone_admin)]$ dcmanager subcloud list
+----+-----------+------------+--------------+---------------+---------+
| id | name | management | availability | deploy status | sync |
+----+-----------+------------+--------------+---------------+---------+
| 2 | subcloud6 | managed | online | complete | in-sync |
| 4 | subcloud4 | managed | online | complete | in-sync |
| 7 | subcloud7 | managed | online | complete | in-sync |
+----+-----------+------------+--------------+---------------+---------+

[sysadmin@controller-0 ~(keystone_admin)]$ sw-patch --os-region-name SystemController upload 2020-04-29_20-00-00_LARGE.patch
2020-04-29_20-00-00_LARGE is now available

[sysadmin@controller-0 ~(keystone_admin)]$ sw-patch --os-region-name SystemController apply 2020-04-29_20-00-00_LARGE
2020-04-29_20-00-00_LARGE is now in the repo

[sysadmin@controller-0 ~(keystone_admin)]$ dcmanager patch-strategy create --subcloud-apply-type parallel --max-parallel-subclouds 10
+------------------------+----------------------------+
| Field | Value |
+------------------------+----------------------------+
| subcloud apply type | parallel |
| max parallel subclouds | 10 |
| stop on failure | False |
| state | initial |
| created_at | 2020-05-02T14:09:14.421615 |
| updated_at | None |
+------------------------+----------------------------+

[sysadmin@controller-0 ~(keystone_admin)]$ dcmanager patch-strategy apply
+------------------------+----------------------------+
| Field | Value |
+------------------------+----------------------------+
| subcloud apply type | parallel |
| max parallel subclouds | 10 |
| stop on failure | False |
| state | applying |
| created_at | 2020-05-02T14:09:14.421615 |
| updated_at | 2020-05-02T14:10:19.170945 |
+------------------------+----------------------------+
[sysadmin@controller-0 ~(keystone_admin)]$ dcmanager strategy-step list
+------------------+-------+-------------------+---------+----------------------------+-------------+
| cloud | stage | state | details | started_at | finished_at |
+------------------+-------+-------------------+---------+----------------------------+-------------+
| SystemController | 1 | creating strategy | | 2020-05-02 14:10:28.837943 | None |
| subcloud6 | 2 | initial | | None | None |
| subcloud4 | 2 | initial | | None | None |
| subcloud7 | 2 | initial | | None | None |
+------------------+-------+-------------------+---------+----------------------------+-------------+

[sysadmin@controller-0 ~(keystone_admin)]$ dcmanager strategy-step list
+------------------+-------+----------+---------------------------------------------------------------------+----------------------------+----------------------------+
| cloud | stage | state | details | started_at | finished_at |
+------------------+-------+----------+---------------------------------------------------------------------+----------------------------+----------------------------+
| SystemController | 1 | complete | | 2020-05-02 14:10:28.837943 | 2020-05-02 14:55:48.468191 |
| subcloud6 | 2 | failed | Strategy apply failed for subcloud6 - unexpected state abort-failed | 2020-05-02 14:55:58.477515 | 2020-05-02 15:30:22.119729 |
| subcloud4 | 2 | complete | | 2020-05-02 14:55:58.484945 | 2020-05-02 15:40:17.458608 |
| subcloud7 | 2 | complete | | 2020-05-02 14:55:58.495617 | 2020-05-02 15:25:26.728835 |
+------------------+-------+----------+---------------------------------------------------------------------+----------------------------+----------------------------+
[sysadmin@controller-0 ~(keystone_admin)]$

Subcloud6:
[sysadmin@controller-0 ~(keystone_admin)]$ fm alarm-list
+----------+----------------------------------------------------------------------------------+-----------------------+----------+----------------+
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+----------+----------------------------------------------------------------------------------+-----------------------+----------+----------------+
| 400.001 | Service group controller-services failure; dnsmasq(enabled-active, failed) | service_domain= | critical | 2020-05-02T17: |
| | | controller. | | 43:38.264746 |
| | | service_group= | | |
| | | controller-services. | | |
| | | host=controller-0 | | |
| | | | | |
| 400.002 | Service group controller-services has no active members available; expected 1 | service_domain= | critical | 2020-05-02T15: |
| | active member | controller. | | 03:27.053786 |
| | | service_group= | | |
| | | controller-services | | |
| | | | | |
| 200.001 | controller-0 was administratively locked to take it out-of-service. | host=controller-0 | warning | 2020-05-02T14: |
| | | | | 59:09.530587 |
| | | | | |
| 400.003 | Evaluation license key will expire on 30-sep-2020; there are 151 days remaining | host=controller-0 | minor | 2020-05-02T00: |
| | in this evaluation | | | 59:13.512835 |
| | | | | |
| 500.101 | Developer patch certificate is enabled | host=controller | critical | 2020-05-01T00: |
| | | | | 11:40.719420 |
| | | | | |
+----------+----------------------------------------------------------------------------------+-----------------------+----------+----------------+
[sysadmin@controller-0 ~(keystone_admin)]$ system host-list
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | locked | disabled | online |

Test Activity
-------------
Regression Testing

Revision history for this message
Peng Peng (ppeng) wrote :
Peng Peng (ppeng)
tags: added: stx.retestneeded
Ghada Khalil (gkhalil)
tags: added: stx.up
tags: added: stx.4.0 stx.distcloud stx.update
removed: stx.up
Changed in starlingx:
status: New → Triaged
importance: Undecided → Medium
assignee: nobody → Bart Wensley (bartwensley)
Revision history for this message
Bart Wensley (bartwensley) wrote :

The strategy apply for subcloud 6 failed because the patch application timed out after 30 minutes:

2020-05-02T14:59:44.942 controller-0 VIM_Thread[88472] INFO _strategy_steps.py.562 Step (sw-patch-hosts) apply for hosts [u'controller-0'].
2020-05-02T15:29:45.219 controller-0 VIM_Thread[88472] INFO _strategy_stage.py.427 Stage (sw-patch-worker-hosts) step (sw-patch-hosts) timed out, timeout_in_secs=1800.

The patching logs indicate that the LARGE patch was being applied, which updates all the packages on the system:

2020-05-02T14:57:37: sw-patch-controller-daemon[10975]: patch_controller.py(1123): INFO: Applying patch: 2020-04-29_20-00-00_LARGE
2020-05-02T14:59:44: sw-patch-controller-daemon[10975]: patch_controller.py(2107): INFO: Running host-install for controller-0 (fd01:15::3), force=False, async_req=True
2020-05-02T15:33:18: sw-patch-agent[10981]: patch_agent.py(545): INFO: Transaction complete: undo_failure=True, success=True
2020-05-02T15:33:18: sw-patch-agent[10981]: patch_agent.py(655): INFO: Reboot is required. Skipping patch-scripts

It took more than 30 minutes for this patch to be installed on controller-0, so the VIM timed out.

I will assign this to Don to comment on the amount of time it takes to install this patch. If it is expected to take 34 minutes to install, then I would suggest that this patch should not be used for testing patch orchestration. No user would apply a patch this large in the real world so I would hesitate to increase our timeouts.

Changed in starlingx:
assignee: Bart Wensley (bartwensley) → Don Penney (dpenney)
Revision history for this message
Bart Wensley (bartwensley) wrote :

Since the analysis indicates this issue has nothing to do with distributed cloud I have removed the stx.distcloud tag.

tags: removed: stx.distcloud
Revision history for this message
Don Penney (dpenney) wrote :

The kmod RPMs are taking roughly a minute and a half each to install, due to the scriptlets being executed in each one. The cost appears to be due to running "weak-modules", which each kmod RPM is running, and takes a long time to run

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kernel (master)

Fix proposed to branch: master
Review: https://review.opendev.org/737079

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kernel (master)

Reviewed: https://review.opendev.org/737079
Committed: https://git.openstack.org/cgit/starlingx/kernel/commit/?id=6764889c7360803549a0e4762198d7b84f2d31f8
Submitter: Zuul
Branch: master

commit 6764889c7360803549a0e4762198d7b84f2d31f8
Author: Don Penney <email address hidden>
Date: Fri Jun 19 16:33:52 2020 -0400

    Drop weak-modules call from kmod RPM scripts

    The kmod RPM scriptlets are calling weak-modules from the postinstall
    and postuninstall scriptlets. Each call is taking approximately 40
    seconds. On an AIO install, for example, this adds about 5 minutes to
    the install. For a kernel software update, since both postinstall and
    postuninstall scriptlets are called, this can add about 10 minutes to
    the update installation.

    As weak modules provide little benefit in a closed system, these calls
    have been removed in this update.

    Change-Id: I7ef577667bef1e75a0aa8542c76d401ecd5c896a
    Closes-Bug: 1876500
    Signed-off-by: Don Penney <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Peng Peng (ppeng)
description: updated
Revision history for this message
Peng Peng (ppeng) wrote :

Verified on
[sysadmin@controller-1 ~(keystone_admin)]$ cat /etc/build.info
###
### Wind River Cloud Platform
### Release 20.06
###
### Wind River Systems, Inc.
###

SW_VERSION="20.06"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="2020-06-24_22-16-59"
SRC_BUILD_ID="34"

JOB="WRCP_20.06_Build"
BUILD_BY="jenkins"
BUILD_NUMBER="34"
BUILD_HOST="yow-cgts4-lx.wrs.com"
BUILD_DATE="2020-06-24 22:19:21 -0400"
[sysadmin@controller-1 ~(keystone_admin)]$ sw-patch query
             Patch ID RR Release Patch State
================================== == ======= ===========
2020-06-24_22-16-59_RR_ALLNODES Y 20.06 Applied
DCMANAGERCLIENT_20.06 N 20.06 Applied
PATCH.ENABLE_DEV_CERTIFICATE-20.06 N 20.06 Applied

tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.