Subcloud silently failed to install patch and reboot
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
Medium
|
Don Penney |
Bug Description
Brief Description
-----------------
While trying to patch 50 subclouds in parallel 1 of them failed to install the patch and reboot, although it reports as patch current.
Severity
--------
Major
Steps to Reproduce
------------------
Upload a patch (sw-patch --os-region-name SystemController upload)
apply the patch (sw-patch --os-region-name SystemController apply)
Try to create patch orchestration strategy (dcmanager patch-strategy create --max-parallel-
Apply the strategy (dcmanager patch-strategy apply )
Expected Behavior
------------------
All subclouds to install the patch and reboot.
Actual Behavior
----------------
1 subcloud failed out of the 50 subclouds
Reproducibility
---------------
Reproducible (seen 2 times) (subcloud54 and subcloud90)
System Configuration
-------
DC system controllers:Two node system with worker
Subclouds: One node system
IPv6
Branch/Pull Time/Commit
-------
2020-06-11_20-00-00
Last Pass
---------
First time doing 50 subclouds in parallel
Timestamp/Logs
--------------
Timing:
Thu Jun 18 01:50:45 UTC 2020 dcmanager patch-strategy create --max-parallel-
Thu Jun 18 01:51:42 UTC 2020 dcmanager patch-strategy apply
Output of “dcmanager strategy-step list” for Subcloud54 and subcloud90:
subcloud54 | 2 | complete | | 2020-06-18 02:22:21.799734 | 2020-06-18 02:23:11.011785
subcloud90 | 3 | complete | | 2020-06-18 02:36:46.112333 | 2020-06-18 02:37:17.404720
subcloud54:
controller-0:~$ sudo sw-patch query
Password:
Patch ID RR Release Patch State
=======
2020-06-
Here is the log snap from subcloud54 (highlighted in Red looks diff on the subcloud that worked):
2020-06-
2020-06-
2020-06-
2020-06-
2020-06-
2020-06-
2020-06-
2020-06-
2020-06-
2020-06-
2020-06-
Spawning worker 0 with 1 pkgs
Spawning worker 1 with 1 pkgs
Spawning worker 2 with 1 pkgs
Spawning worker 3 with 0 pkgs
Workers Finished
Saving Primary metadata
Saving file lists metadata
Saving other metadata
Generating sqlite DBs
Sqlite DBs complete
2020-06-
2020-06-
2020-06-
2020-06-
2020-06-
2020-06-
2020-06-
2020-06-
2020-06-
2020-06-
2020-06-
2020-06-
2020-06-
2020-06-
Test Activity
-------------
System Test
Workaround
----------
Reboot failed subclouds
Changed in starlingx: | |
assignee: | nobody → Don Penney (dpenney) |
tags: | added: stx.update |
controller-0:~# sw-patch query-hosts
Hostname IP Address Patch Current Reboot Required Release State
============ =========== ============= =============== ======= =====
controller-0 fd01:136::3 Yes No 20.06 idle
[root@controller-0 ~(keystone_admin)]# sw-patch show 2020-06- 11_20-00- 00_RR_ALLNODES 11_20-00- 00_RR_ALLNODES:
RR_ ALLNODES- 1.0-1.tis. x86_64. rpm
logmgmt- 1.0-6.tis. x86_64. rpm
logmgmt- wheels- 1.0-6.tis. x86_64. rpm
2020-06-
Release: 20.06
Patch State: Applied
RR: Y
Summary: Patch to /etc/init.d/logmgmt
Contents:
[root@controller-0 ~(keystone_admin)]# rpm -qa | grep logmgmt 1.0-4.tis. x86_64
logmgmt-
Don's Analysis:
It’s very odd…. the timestamps are all pretty close, but the patch-controller applies the patch, updating the repo… then the patch-agent queries the repo… but seems to have gotten stale metadata somehow… I think this is the first time I’ve wished I had sub-second precision logging in patching.log
Sequence on patch-apply:
• controller:
o copy rpm to updates repo
o run createrepo to update repo metadata (repomd.xml modified at 02:22:22.449000000)
o increment patch_op_counter as part of “sync” request, which is included in HELLO messages (.controller.state file modified at 02:22:22.466000000)
• agent:
o receives HELLO message
o if patch_op_counter is greater than cached counter value:
run “dnf makecache” to ensure local dnf data is updated
perform new query
send results back to patch-controller
In patching.log, I see the repo update and the “sync” request, then I see the patch agent logs showing it does a query. I see the repomd.xml query in /www/var/ log/lighttpd- access. log (two, actually, right together… the first presumably the makecache, and the second the new query).
HOWEVER…. see these logs:
fd01:136::2 controller:8080 - [17/Jun/ 2020:23: 13:34 +0000] "GET /updates/ rel-20. 06/repodata/ repomd. xml HTTP/1.1" 200 3661 "-" "libdnf" 2020:23: 17:20 +0000] "GET /updates/ rel-20. 06/repodata/ repomd. xml HTTP/1.1" 200 3661 "-" "libdnf" 2020:02: 17:23 +0000] "GET /updates/ rel-20. 06/repodata/ repomd. xml HTTP/1.1" 200 3661 "-" "libdnf" 2020:02: 22:22 +0000] "GET /updates/ rel-20. 06/repodata/ repomd. xml HTTP/1.1" 200 3661 "-" "libdnf" 2020:02: 22:22 +0000] "GET /updates/ rel-20. 06/repodata/ repomd. xml HTTP/1.1" 200 3661 "-" "libdnf" 2020:05: 17:24 +0000] "GET /updates/ rel-20. 06/repodata/ repomd. xml HTTP/1.1" 200 3665 "-" "libdnf" 2020:05: 17:24 +0000] "GET /updates/ rel-20. 06/repodata/ repomd. xml HTTP/1.1" 200 3665 "-" "libdnf"
fd01:136::2 controller:8080 - [17/Jun/
fd01:136::2 controller:8080 - [18/Jun/
fd01:136::2 controller:8080 - [18/Jun/
fd01:136::2 controller:8080 - [18/Jun/
fd01:136::2 controller:8080 - [18/Jun/
fd01:136::2 controller:8080 - [18/Jun/
That highlighted field is the file size being returned… which shows that the repomd.xml returned to the patch-agent when it did the query was in fact the stale one… 0:/home/ sysadmin# stat /www/pages/ updates/ rel-20. 06/repodata/ repomd. xml updates/ rel-20. 06/repodata/ repomd. xml’
controller-
File: ‘/www/pages/
Size: 3665 Blocks: 8 ...