It’s very odd…. the timestamps are all pretty close, but the patch-controller applies the patch, updating the repo… then the patch-agent queries the repo… but seems to have gotten stale metadata somehow… I think this is the first time I’ve wished I had sub-second precision logging in patching.log
Sequence on patch-apply:
• controller:
o copy rpm to updates repo
o run createrepo to update repo metadata (repomd.xml modified at 02:22:22.449000000)
o increment patch_op_counter as part of “sync” request, which is included in HELLO messages (.controller.state file modified at 02:22:22.466000000)
• agent:
o receives HELLO message
o if patch_op_counter is greater than cached counter value:
run “dnf makecache” to ensure local dnf data is updated
perform new query
send results back to patch-controller
In patching.log, I see the repo update and the “sync” request, then I see the patch agent logs showing it does a query. I see the repomd.xml query in /www/var/log/lighttpd-access.log (two, actually, right together… the first presumably the makecache, and the second the new query).
That highlighted field is the file size being returned… which shows that the repomd.xml returned to the patch-agent when it did the query was in fact the stale one…
controller-0:/home/sysadmin# stat /www/pages/updates/rel-20.06/repodata/repomd.xml
File: ‘/www/pages/updates/rel-20.06/repodata/repomd.xml’
Size: 3665 Blocks: 8 IO Block: 4096 regular file
Device: fd04h/64772d Inode: 402773 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2020-06-18 02:22:22.977000000 +0000
Modify: 2020-06-18 02:22:22.449000000 +0000
Change: 2020-06-18 02:22:22.449000000 +0000
Birth: -
Which makes me wonder… is lighttpd caching files, maybe? And the timing of this subcloud hit in that narrow window?
Only thing that is close to making any sense to me… that we fell into a case where lighttpd somehow cached the file, and because everything happened so quickly, it returned stale data back to the patch-agent.
controller-0:~# sw-patch query-hosts
Hostname IP Address Patch Current Reboot Required Release State
============ =========== ============= =============== ======= =====
controller-0 fd01:136::3 Yes No 20.06 idle
[root@controller-0 ~(keystone_admin)]# sw-patch show 2020-06- 11_20-00- 00_RR_ALLNODES 11_20-00- 00_RR_ALLNODES:
RR_ ALLNODES- 1.0-1.tis. x86_64. rpm
logmgmt- 1.0-6.tis. x86_64. rpm
logmgmt- wheels- 1.0-6.tis. x86_64. rpm
2020-06-
Release: 20.06
Patch State: Applied
RR: Y
Summary: Patch to /etc/init.d/logmgmt
Contents:
[root@controller-0 ~(keystone_admin)]# rpm -qa | grep logmgmt 1.0-4.tis. x86_64
logmgmt-
Don's Analysis:
It’s very odd…. the timestamps are all pretty close, but the patch-controller applies the patch, updating the repo… then the patch-agent queries the repo… but seems to have gotten stale metadata somehow… I think this is the first time I’ve wished I had sub-second precision logging in patching.log
Sequence on patch-apply:
• controller:
o copy rpm to updates repo
o run createrepo to update repo metadata (repomd.xml modified at 02:22:22.449000000)
o increment patch_op_counter as part of “sync” request, which is included in HELLO messages (.controller.state file modified at 02:22:22.466000000)
• agent:
o receives HELLO message
o if patch_op_counter is greater than cached counter value:
run “dnf makecache” to ensure local dnf data is updated
perform new query
send results back to patch-controller
In patching.log, I see the repo update and the “sync” request, then I see the patch agent logs showing it does a query. I see the repomd.xml query in /www/var/ log/lighttpd- access. log (two, actually, right together… the first presumably the makecache, and the second the new query).
HOWEVER…. see these logs:
fd01:136::2 controller:8080 - [17/Jun/ 2020:23: 13:34 +0000] "GET /updates/ rel-20. 06/repodata/ repomd. xml HTTP/1.1" 200 3661 "-" "libdnf" 2020:23: 17:20 +0000] "GET /updates/ rel-20. 06/repodata/ repomd. xml HTTP/1.1" 200 3661 "-" "libdnf" 2020:02: 17:23 +0000] "GET /updates/ rel-20. 06/repodata/ repomd. xml HTTP/1.1" 200 3661 "-" "libdnf" 2020:02: 22:22 +0000] "GET /updates/ rel-20. 06/repodata/ repomd. xml HTTP/1.1" 200 3661 "-" "libdnf" 2020:02: 22:22 +0000] "GET /updates/ rel-20. 06/repodata/ repomd. xml HTTP/1.1" 200 3661 "-" "libdnf" 2020:05: 17:24 +0000] "GET /updates/ rel-20. 06/repodata/ repomd. xml HTTP/1.1" 200 3665 "-" "libdnf" 2020:05: 17:24 +0000] "GET /updates/ rel-20. 06/repodata/ repomd. xml HTTP/1.1" 200 3665 "-" "libdnf"
fd01:136::2 controller:8080 - [17/Jun/
fd01:136::2 controller:8080 - [18/Jun/
fd01:136::2 controller:8080 - [18/Jun/
fd01:136::2 controller:8080 - [18/Jun/
fd01:136::2 controller:8080 - [18/Jun/
fd01:136::2 controller:8080 - [18/Jun/
That highlighted field is the file size being returned… which shows that the repomd.xml returned to the patch-agent when it did the query was in fact the stale one… 0:/home/ sysadmin# stat /www/pages/ updates/ rel-20. 06/repodata/ repomd. xml updates/ rel-20. 06/repodata/ repomd. xml’
controller-
File: ‘/www/pages/
Size: 3665 Blocks: 8 IO Block: 4096 regular file
Device: fd04h/64772d Inode: 402773 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2020-06-18 02:22:22.977000000 +0000
Modify: 2020-06-18 02:22:22.449000000 +0000
Change: 2020-06-18 02:22:22.449000000 +0000
Birth: -
Which makes me wonder… is lighttpd caching files, maybe? And the timing of this subcloud hit in that narrow window?
https:/ /redmine. lighttpd. net/projects/ lighttpd/ wiki/Docs_ Performance# stat-cache /redmine. lighttpd. net/projects/ lighttpd/ wiki/Server_ stat-cache- engineDetails
https:/
Only thing that is close to making any sense to me… that we fell into a case where lighttpd somehow cached the file, and because everything happened so quickly, it returned stale data back to the patch-agent.