Subcloud silently failed to install patch and reboot

Bug #1884094 reported by Nimalini Rasa
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Don Penney

Bug Description

Brief Description
-----------------
While trying to patch 50 subclouds in parallel 1 of them failed to install the patch and reboot, although it reports as patch current.

Severity
--------
Major

Steps to Reproduce
------------------
Upload a patch (sw-patch --os-region-name SystemController upload)
apply the patch (sw-patch --os-region-name SystemController apply)
Try to create patch orchestration strategy (dcmanager patch-strategy create --max-parallel-subclouds 50)
Apply the strategy (dcmanager patch-strategy apply )

Expected Behavior
------------------
All subclouds to install the patch and reboot.

Actual Behavior
----------------
1 subcloud failed out of the 50 subclouds

Reproducibility
---------------
Reproducible (seen 2 times) (subcloud54 and subcloud90)

System Configuration
--------------------
DC system controllers:Two node system with worker
Subclouds: One node system
IPv6

Branch/Pull Time/Commit
-----------------------
2020-06-11_20-00-00

Last Pass
---------
First time doing 50 subclouds in parallel

Timestamp/Logs
--------------
Timing:
Thu Jun 18 01:50:45 UTC 2020  dcmanager patch-strategy create --max-parallel-subclouds 50
Thu Jun 18 01:51:42 UTC 2020  dcmanager patch-strategy apply

Output of “dcmanager strategy-step list”  for Subcloud54 and subcloud90:
subcloud54       |     2 | complete |         | 2020-06-18 02:22:21.799734 | 2020-06-18 02:23:11.011785
subcloud90       |     3 | complete |         | 2020-06-18 02:36:46.112333 | 2020-06-18 02:37:17.404720
subcloud54:
controller-0:~$ sudo sw-patch query
Password:
           Patch ID              RR  Release  Patch State
===============================  ==  =======  ===========
2020-06-11_20-00-00_RR_ALLNODES  Y    20.06     Applied

Here is the log snap from subcloud54 (highlighted in Red looks diff on the subcloud that worked):
2020-06-18T02:22:21: sw-patch-controller-daemon[76960]: patch_controller.py(947): INFO: Importing patches: /scratch/2020-06-11_20-00-00_RR_ALLNODES.patch
2020-06-18T02:22:21: sw-patch-controller-daemon[76960]: patch_controller.py(952): INFO: Importing patch: /scratch/2020-06-11_20-00-00_RR_ALLNODES.patch
2020-06-18T02:22:21: sw-patch-controller-daemon[76960]: patch_functions.py(998): INFO: Signature verified, patch has been signed
2020-06-18T02:22:21: sw-patch-controller-daemon[76960]: patch_controller.py(293): INFO: sending sync req
2020-06-18T02:22:21: sw-patch-controller-daemon[76960]: patch_controller.py(1733): INFO: Sync complete
2020-06-18T02:22:21: sw-patch-controller-daemon[76960]: patch_controller.py(1048): INFO: Applying patches: 2020-06-11_20-00-00_RR_ALLNODES
2020-06-18T02:22:21: sw-patch-controller-daemon[76960]: patch_controller.py(1123): INFO: Applying patch: 2020-06-11_20-00-00_RR_ALLNODES
2020-06-18T02:22:21: sw-patch-controller-daemon[76960]: patch_controller.py(1161): INFO: Copy /opt/patching/packages/20.06/logmgmt-wheels-1.0-6.tis.x86_64.rpm to /www/pages/updates/rel-20.06/Packages/x86_64/logmgmt-wheels-1.0-6.tis.x86_64.rpm
2020-06-18T02:22:21: sw-patch-controller-daemon[76960]: patch_controller.py(1161): INFO: Copy /opt/patching/packages/20.06/RR_ALLNODES-1.0-1.tis.x86_64.rpm to /www/pages/updates/rel-20.06/Packages/x86_64/RR_ALLNODES-1.0-1.tis.x86_64.rpm
2020-06-18T02:22:21: sw-patch-controller-daemon[76960]: patch_controller.py(1161): INFO: Copy /opt/patching/packages/20.06/logmgmt-1.0-6.tis.x86_64.rpm to /www/pages/updates/rel-20.06/Packages/x86_64/logmgmt-1.0-6.tis.x86_64.rpm
2020-06-18T02:22:22: sw-patch-controller-daemon[76960]: patch_controller.py(1209): INFO: Repo[20.06] updated:
Spawning worker 0 with 1 pkgs
Spawning worker 1 with 1 pkgs
Spawning worker 2 with 1 pkgs
Spawning worker 3 with 0 pkgs
Workers Finished
Saving Primary metadata
Saving file lists metadata
Saving other metadata
Generating sqlite DBs
Sqlite DBs complete

2020-06-18T02:22:22: sw-patch-controller-daemon[76960]: patch_controller.py(293): INFO: sending sync req
2020-06-18T02:22:22: sw-patch-controller-daemon[76960]: patch_controller.py(1733): INFO: Sync complete
2020-06-18T02:22:22: sw-patch-agent[76918]: patch_agent.py(511): INFO: Patch state query returns False
2020-06-18T02:22:22: sw-patch-agent[76918]: patch_agent.py(512): INFO: Installed: {}
2020-06-18T02:22:22: sw-patch-agent[76918]: patch_agent.py(513): INFO: To install: {}
2020-06-18T02:22:22: sw-patch-agent[76918]: patch_agent.py(514): INFO: To remove: []
2020-06-18T02:22:22: sw-patch-agent[76918]: patch_agent.py(515): INFO: Missing: []
2020-06-18T02:22:23: sw-patch-agent[76918]: patch_agent.py(511): INFO: Patch state query returns False
2020-06-18T02:22:23: sw-patch-agent[76918]: patch_agent.py(512): INFO: Installed: {}
2020-06-18T02:22:23: sw-patch-agent[76918]: patch_agent.py(513): INFO: To install: {}
2020-06-18T02:22:23: sw-patch-agent[76918]: patch_agent.py(514): INFO: To remove: []
2020-06-18T02:22:23: sw-patch-agent[76918]: patch_agent.py(515): INFO: Missing: []
2020-06-18T02:22:23: sw-patch-agent[76918]: patch_agent.py(189): INFO: Handling detailed query
2020-06-18T02:22:23: sw-patch-agent[76918]: patch_agent.py(189): INFO: Handling detailed query

Test Activity
-------------
System Test

Workaround
----------
Reboot failed subclouds

Revision history for this message
Nimalini Rasa (nrasa) wrote :
Download full text (3.9 KiB)

controller-0:~# sw-patch query-hosts
  Hostname IP Address Patch Current Reboot Required Release State
============ =========== ============= =============== ======= =====
controller-0 fd01:136::3 Yes No 20.06 idle

[root@controller-0 ~(keystone_admin)]# sw-patch show 2020-06-11_20-00-00_RR_ALLNODES
2020-06-11_20-00-00_RR_ALLNODES:
    Release: 20.06
    Patch State: Applied
    RR: Y
    Summary: Patch to /etc/init.d/logmgmt
    Contents:
                    RR_ALLNODES-1.0-1.tis.x86_64.rpm
                    logmgmt-1.0-6.tis.x86_64.rpm
                    logmgmt-wheels-1.0-6.tis.x86_64.rpm

[root@controller-0 ~(keystone_admin)]# rpm -qa | grep logmgmt
logmgmt-1.0-4.tis.x86_64

Don's Analysis:

It’s very odd…. the timestamps are all pretty close, but the patch-controller applies the patch, updating the repo… then the patch-agent queries the repo… but seems to have gotten stale metadata somehow… I think this is the first time I’ve wished I had sub-second precision logging in patching.log 

Sequence on patch-apply:
• controller:
o copy rpm to updates repo
o run createrepo to update repo metadata (repomd.xml modified at 02:22:22.449000000)
o increment patch_op_counter as part of “sync” request, which is included in HELLO messages (.controller.state file modified at 02:22:22.466000000)
• agent:
o receives HELLO message
o if patch_op_counter is greater than cached counter value:
 run “dnf makecache” to ensure local dnf data is updated
 perform new query
 send results back to patch-controller

In patching.log, I see the repo update and the “sync” request, then I see the patch agent logs showing it does a query. I see the repomd.xml query in /www/var/log/lighttpd-access.log (two, actually, right together… the first presumably the makecache, and the second the new query).

HOWEVER…. see these logs:

fd01:136::2 controller:8080 - [17/Jun/2020:23:13:34 +0000] "GET /updates/rel-20.06/repodata/repomd.xml HTTP/1.1" 200 3661 "-" "libdnf"
fd01:136::2 controller:8080 - [17/Jun/2020:23:17:20 +0000] "GET /updates/rel-20.06/repodata/repomd.xml HTTP/1.1" 200 3661 "-" "libdnf"
fd01:136::2 controller:8080 - [18/Jun/2020:02:17:23 +0000] "GET /updates/rel-20.06/repodata/repomd.xml HTTP/1.1" 200 3661 "-" "libdnf"
fd01:136::2 controller:8080 - [18/Jun/2020:02:22:22 +0000] "GET /updates/rel-20.06/repodata/repomd.xml HTTP/1.1" 200 3661 "-" "libdnf"
fd01:136::2 controller:8080 - [18/Jun/2020:02:22:22 +0000] "GET /updates/rel-20.06/repodata/repomd.xml HTTP/1.1" 200 3661 "-" "libdnf"
fd01:136::2 controller:8080 - [18/Jun/2020:05:17:24 +0000] "GET /updates/rel-20.06/repodata/repomd.xml HTTP/1.1" 200 3665 "-" "libdnf"
fd01:136::2 controller:8080 - [18/Jun/2020:05:17:24 +0000] "GET /updates/rel-20.06/repodata/repomd.xml HTTP/1.1" 200 3665 "-" "libdnf"

That highlighted field is the file size being returned… which shows that the repomd.xml returned to the patch-agent when it did the query was in fact the stale one…
controller-0:/home/sysadmin# stat /www/pages/updates/rel-20.06/repodata/repomd.xml
  File: ‘/www/pages/updates/rel-20.06/repodata/repomd.xml’
  Size: 3665 Blocks: 8 ...

Read more...

Revision history for this message
Nimalini Rasa (nrasa) wrote :
Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: nobody → Don Penney (dpenney)
tags: added: stx.update
Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.5.0 - requires further investigation, but given the impact was limited to 1 subcloud out of 50 and there is a workaround to recover, this will not hold up stx.4.0

Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
tags: added: stx.5.0
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to update (master)

Fix proposed to branch: master
Review: https://review.opendev.org/737403

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to update (master)

Reviewed: https://review.opendev.org/737403
Committed: https://git.openstack.org/cgit/starlingx/update/commit/?id=d3f8af4f4c6df5eede6f83d67bff8970d4bd1ed7
Submitter: Zuul
Branch: master

commit d3f8af4f4c6df5eede6f83d67bff8970d4bd1ed7
Author: Don Penney <email address hidden>
Date: Mon Jun 22 21:16:35 2020 -0400

    Protect against stale repo data in DNF cache

    When the patch-agent is notified by the patch-controller of a new
    patching operation (patch_op_counter in HELLO is incremented), it will
    run a new software query consisting first of running "dnf makecache",
    then checking the software repositories for changes. In rare cases,
    the metadata returned to dnf when it makes the query could be stale,
    resulting in the patch-agent believing the current software is
    up-to-date.

    In order to protect against this, the patch-agent will now verify the
    repository revision id for the updates repo. If the revision id has
    not changed when it is reasonable to expect it could have (ie. the
    patch_op_counter has increased, indicating the repository may have
    been updated), the patch-agent will retry once to allow for any
    potential caching to have cleared.

    Change-Id: I3a44ed86e16cd9fe67f0b0e763c95a5a7e126cf8
    Closes-Bug: 1884094
    Signed-off-by: Don Penney <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Changing the release tag to stx.4.0 since the fix was merged in stx master for that release.

tags: added: stx.4.0
removed: stx.5.0
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to update (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/update/+/792216

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on update (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/update/+/792216

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to update (f/centos8)
Download full text (10.3 KiB)

Reviewed: https://review.opendev.org/c/starlingx/update/+/793629
Committed: https://opendev.org/starlingx/update/commit/e7f342378bcfe05e81c53f6d15e44a61e89f6caf
Submitter: "Zuul (22348)"
Branch: f/centos8

commit 49e39fa949850a7d6fc6083539a4cd3390e6172b
Author: Charles Short <email address hidden>
Date: Fri May 28 08:55:08 2021 -0400

    Specify the nodeset zuul jobs

    The py2.7 jobs need to specify xenial
    The py3.6 jobs need to specify bionic
    The focal zuul nodes only have python 3.8 installed in them

    Zuul targets that invoke a generic python3 interpreter such
    as pep8 is not specified.

    Also ignore H216 since we still use py2.7.

    The copyright date was updated in order to trigger
    the zuul jobs, as a no-delta type of change.

    Partial-Bug: 1928978

    Signed-off-by: Charles Short <email address hidden>
    Change-Id: I81fd790dfc8a665a4e4e0ff59a013af7921b6e06
    Signed-off-by: Charles Short <email address hidden>

commit 6a019e60c3bd0943904589e47e32c732e48a8086
Author: Chris Friesen <email address hidden>
Date: Thu Mar 4 18:59:53 2021 -0500

    add dcmanager-audit-worker to sample restart script

    We've added a new distributed-cloud process, so update the sample script.

    Story: 2007267
    Task: 42000
    Signed-off-by: Chris Friesen <email address hidden>
    Change-Id: I16522744c9526e134577bda80cf91a8308eee889

commit 9a08e51d41a7b1b5d4c25c2a7775c70b57d278d9
Author: Charles Short <email address hidden>
Date: Sun Jan 24 08:36:23 2021 -0500

    Cap bandit to v1.6.2

    Cap bandit to v1.6.2 so we do not pull in a python3 only version.
    Without this fix the unit tests will fail to setup and run properly.

    Closes-Bug: 1916494

    Signed-off-by: Charles Short <email address hidden>
    Change-Id: I7afbc8d224f28146af42f43593586cd680a52aaf

commit 764a0576011c4a8cb43b2bd3ee1fd890bad6058a
Author: Don Penney <email address hidden>
Date: Mon Dec 21 15:22:29 2020 -0500

    Fix exclusion paths in cgcs-patch build_srpm.data

    The EXCLUDE_LIST_FROM_TAR list in cgcs-patch build_srpm.data includes
    entries to exclude .tox and other dirs when building the package, but
    the path included an extra directory level. This update corrects the
    paths.

    Change-Id: I8b3641e4e86f52ef7b9fb56f9eb3df289935e188
    Closes-Bug: 1908940
    Signed-off-by: Don Penney <email address hidden>

commit adaaba0c21e78a2a24c459e34ea7c9963024e3f6
Author: Don Penney <email address hidden>
Date: Thu Dec 17 13:17:58 2020 -0500

    Add auto-version for remaining stx/update packages

    Update remaining StarlingX packages with hardcoded TIS_PATCH_VER to
    use PKG_GITREVCOUNT where possible, with offsets as needed to ensure
    the version is incremented above the hardcoded version.

    Change-Id: I877a6c0802a707863a3b6c7d430146431e10a3ad
    Story: 2008455
    Task: 41459
    Signed-off-by: Don Penney <email address hidden>

commit 62a66370cac70a62f827fedfc8b76284fde006d3
Author: Don Penney <email address hidden>
Date: Thu Nov 19 10:41:27 2020 -0500

    Add protection a...

tags: added: in-f-centos8
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.