StarlingX

Bug #1884094
Comment #1

Comment 1 for bug 1884094

Revision history for this message

Nimalini Rasa (nrasa) wrote on 2020-06-18:

controller-0:~# sw-patch query-hosts
Hostname IP Address Patch Current Reboot Required Release State
============ =========== ============= =============== ======= =====
controller-0 fd01:136::3 Yes No 20.06 idle

[root@controller-0 ~(keystone_admin)]# sw-patch show 2020-06-11_20-00-00_RR_ALLNODES
2020-06-11_20-00-00_RR_ALLNODES:
    Release: 20.06
    Patch State: Applied
    RR: Y
    Summary: Patch to /etc/init.d/logmgmt
    Contents:
                    RR_ALLNODES-1.0-1.tis.x86_64.rpm
                    logmgmt-1.0-6.tis.x86_64.rpm
                    logmgmt-wheels-1.0-6.tis.x86_64.rpm

[root@controller-0 ~(keystone_admin)]# rpm -qa | grep logmgmt
logmgmt-1.0-4.tis.x86_64

Don's Analysis:

It’s very odd…. the timestamps are all pretty close, but the patch-controller applies the patch, updating the repo… then the patch-agent queries the repo… but seems to have gotten stale metadata somehow… I think this is the first time I’ve wished I had sub-second precision logging in patching.log 

Sequence on patch-apply:
• controller:
o copy rpm to updates repo
o run createrepo to update repo metadata (repomd.xml modified at 02:22:22.449000000)
o increment patch_op_counter as part of “sync” request, which is included in HELLO messages (.controller.state file modified at 02:22:22.466000000)
• agent:
o receives HELLO message
o if patch_op_counter is greater than cached counter value:
 run “dnf makecache” to ensure local dnf data is updated
 perform new query
 send results back to patch-controller

In patching.log, I see the repo update and the “sync” request, then I see the patch agent logs showing it does a query. I see the repomd.xml query in /www/var/log/lighttpd-access.log (two, actually, right together… the first presumably the makecache, and the second the new query).

HOWEVER…. see these logs:

fd01:136::2 controller:8080 - [17/Jun/2020:23:13:34 +0000] "GET /updates/rel-20.06/repodata/repomd.xml HTTP/1.1" 200 3661 "-" "libdnf"
fd01:136::2 controller:8080 - [17/Jun/2020:23:17:20 +0000] "GET /updates/rel-20.06/repodata/repomd.xml HTTP/1.1" 200 3661 "-" "libdnf"
fd01:136::2 controller:8080 - [18/Jun/2020:02:17:23 +0000] "GET /updates/rel-20.06/repodata/repomd.xml HTTP/1.1" 200 3661 "-" "libdnf"
fd01:136::2 controller:8080 - [18/Jun/2020:02:22:22 +0000] "GET /updates/rel-20.06/repodata/repomd.xml HTTP/1.1" 200 3661 "-" "libdnf"
fd01:136::2 controller:8080 - [18/Jun/2020:02:22:22 +0000] "GET /updates/rel-20.06/repodata/repomd.xml HTTP/1.1" 200 3661 "-" "libdnf"
fd01:136::2 controller:8080 - [18/Jun/2020:05:17:24 +0000] "GET /updates/rel-20.06/repodata/repomd.xml HTTP/1.1" 200 3665 "-" "libdnf"
fd01:136::2 controller:8080 - [18/Jun/2020:05:17:24 +0000] "GET /updates/rel-20.06/repodata/repomd.xml HTTP/1.1" 200 3665 "-" "libdnf"

That highlighted field is the file size being returned… which shows that the repomd.xml returned to the patch-agent when it did the query was in fact the stale one…
controller-0:/home/sysadmin# stat /www/pages/updates/rel-20.06/repodata/repomd.xml
File: ‘/www/pages/updates/rel-20.06/repodata/repomd.xml’
Size: 3665 Blocks: 8 IO Block: 4096 regular file
Device: fd04h/64772d Inode: 402773 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2020-06-18 02:22:22.977000000 +0000
Modify: 2020-06-18 02:22:22.449000000 +0000
Change: 2020-06-18 02:22:22.449000000 +0000
Birth: -

Which makes me wonder… is lighttpd caching files, maybe? And the timing of this subcloud hit in that narrow window?

https://redmine.lighttpd.net/projects/lighttpd/wiki/Docs_Performance#stat-cache
https://redmine.lighttpd.net/projects/lighttpd/wiki/Server_stat-cache-engineDetails

Only thing that is close to making any sense to me… that we fell into a case where lighttpd somehow cached the file, and because everything happened so quickly, it returned stale data back to the patch-agent.

controller-0:~# sw-patch query-hosts
  Hostname    IP Address   Patch Current  Reboot Required  Release  State
============  ===========  =============  ===============  =======  =====
controller-0  fd01:136::3       Yes             No          20.06   idle

[root@controller-0 ~(keystone_admin)]# sw-patch show 2020-06-11_20-00-00_RR_ALLNODES
2020-06-11_20-00-00_RR_ALLNODES:
    Release:        20.06
    Patch State:    Applied
    RR:             Y
    Summary:        Patch to /etc/init.d/logmgmt
    Contents:
                    RR_ALLNODES-1.0-1.tis.x86_64.rpm
                    logmgmt-1.0-6.tis.x86_64.rpm
                    logmgmt-wheels-1.0-6.tis.x86_64.rpm

[root@controller-0 ~(keystone_admin)]# rpm -qa | grep logmgmt
logmgmt-1.0-4.tis.x86_64

Don's Analysis:

Sequence on patch-apply:
•	controller:
o	copy rpm to updates repo
o	run createrepo to update repo metadata (repomd.xml modified at 02:22:22.449000000)
o	increment patch_op_counter as part of “sync” request, which is included in HELLO messages (.controller.state file modified at 02:22:22.466000000)
•	agent:
o	receives HELLO message
o	if patch_op_counter is greater than cached counter value:
	run “dnf makecache” to ensure local dnf data is updated
	perform new query
	send results back to patch-controller

HOWEVER…. see these logs:

That highlighted field is the file size being returned… which shows that the repomd.xml returned to the patch-agent when it did the query was in fact the stale one… 
controller-0:/home/sysadmin# stat /www/pages/updates/rel-20.06/repodata/repomd.xml
  File: ‘/www/pages/updates/rel-20.06/repodata/repomd.xml’
  Size: 3665            Blocks: 8          IO Block: 4096   regular file
Device: fd04h/64772d    Inode: 402773      Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2020-06-18 02:22:22.977000000 +0000
Modify: 2020-06-18 02:22:22.449000000 +0000
Change: 2020-06-18 02:22:22.449000000 +0000
Birth: -

Which makes me wonder… is lighttpd caching files, maybe? And the timing of this subcloud hit in that narrow window?

https://redmine.lighttpd.net/projects/lighttpd/wiki/Docs_Performance#stat-cache
https://redmine.lighttpd.net/projects/lighttpd/wiki/Server_stat-cache-engineDetails