200.006 controller-0 is degraded due to the failure of its 'sw-patch-agent' process. Auto recovery of this major process is in progress.

Bug #1955076 reported by Alexandru Dimofte
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Invalid
Critical
Alexandru Dimofte

Bug Description

Brief Description
-----------------
All bare metal configurations failed during provisioning. The controller-0 is in degraded state and this alarm is visible:
200.006 controller-0 is degraded due to the failure of its 'sw-patch-agent' process. Auto recovery of this major process is in progress.
The virtual configurations were not affected.

Severity
--------
<Critical: System/Feature is not usable due to the defect>

Steps to Reproduce
------------------
Try to instal stx 6.0 image 20211215T131045Z

Expected Behavior
------------------
Installation should work fine.

Actual Behavior
----------------
During provisioning stx install fails and controller-0 availability is degraded.

Reproducibility
---------------
100% reproducible on bare metal machine

System Configuration
--------------------
One node system, Two node system, Multi-node system, Dedicated storage

Branch/Pull Time/Commit
-----------------------
RC 6.0 20211215T131045Z

Last Pass
---------
This is first RC 6.0. On master branch works fine.

Timestamp/Logs
--------------
Will be attached

Test Activity
-------------
Sanity

Workaround
----------
-

Revision history for this message
Alexandru Dimofte (adimofte) wrote :
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Screening: Marking as stx.6.0 / critical; issue causing a red sanity for the stx.6.0 rc build. Issue is not seen in master branch builds

Changed in starlingx:
importance: Undecided → Critical
tags: added: stx.6.0 stx.update
Revision history for this message
Al Bailey (albailey1974) wrote :

This is the stacktrace
2021-12-16T20:24:04: sw-patch-agent[2458087]: patch_functions.py(68): ERROR: Uncaught exception
Traceback (most recent call last):
  File "/usr/sbin/sw-patch-agent", line 15, in <module>
    main()
  File "/usr/lib64/python2.7/site-packages/cgcs_patch/patch_agent.py", line 924, in main
    pa.query()
  File "/usr/lib64/python2.7/site-packages/cgcs_patch/patch_agent.py", line 541, in query
    for pkg in pkggrp.packages_iter():
AttributeError: 'NoneType' object has no attribute 'packages_iter'

Now I just need to determine why pkggrp is None

Revision history for this message
Al Bailey (albailey1974) wrote :

The patch agent already double checks if the groups cannot be queried, and allows the exception to occur
https://github.com/starlingx/update/blob/master/cgcs-patch/cgcs-patch/cgcs_patch/patch_agent.py#L539

So the root issue is the missing groups. Still digging.

Ghada Khalil (gkhalil)
Changed in starlingx:
status: New → In Progress
assignee: nobody → Al Bailey (albailey1974)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to update (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/update/+/822197

Revision history for this message
Al Bailey (albailey1974) wrote :

The reason the group is not found is because one of the 2 repos fails to load.

In that hardware env the feed location is never accessible. Whereas the updates one is.

ex logs:
http://controller:8080/feed/rel-21.12/repodata/repomd.xml
2021-12-16T20:24:39Z DEBUG check_transfer_statuses: Transfer finished: repodata/repomd.xml (Effective url: http://controller:8080/feed/rel-21.12/repodata/repomd.xml)
2021-12-16T20:24:39Z DEBUG check_transfer_statuses: Error during transfer: Status code: 404 for http://controller:8080/feed/rel-21.12/repodata/repomd.xml
2021-12-16T20:24:39Z DEBUG check_transfer_statuses: Ignore error - Try another mirror
2021-12-16T20:24:39Z DEBUG select_next_target: Selecting mirror for: repodata/repomd.xml
2021-12-16T20:24:39Z DEBUG select_suitable_mirror: All mirrors were tried without success
2021-12-16T20:24:39Z DEBUG lr_download: Error while downloading: Cannot download repodata/repomd.xml: All mirrors were tried

Revision history for this message
Al Bailey (albailey1974) wrote :

The missing or invalid repo is /www/pages/feed/rel-21.12
That gets setup during installation (before running ansible etc..)
This sounds like some sort of build issue.

Revision history for this message
Ghada Khalil (gkhalil) wrote (last edit ):

There was indeed an issue in the setup/builds for the r/stx.6.0 branch which has been recently addressed:
https://review.opendev.org/c/starlingx/tools/+/822350

The fix is available in http://mirror.starlingx.cengn.ca/mirror/starlingx/rc/6.0/centos/flock/20211221T231755Z/

So we'll monitor the sanity with the above build and update this issue accordingly.

Revision history for this message
Ghada Khalil (gkhalil) wrote :
Revision history for this message
Ghada Khalil (gkhalil) wrote (last edit ):

Based on Al Bailey's comment in https://bugs.launchpad.net/starlingx/+bug/1955329, sanity is now hitting the repo setup issue reported here.
See the comments and log attachments from Alexandru Dimofte on 2021-12-22 in LP# 1955329

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Confirmed with Alexandru Dimofte today that the latest attempted sanity passes on virtual env, but not on baremetal.

Revision history for this message
Alexandru Dimofte (adimofte) wrote :

Oh, yes Ghada, I confirm this, as I already said in the bug description, the virtual configurations were not affected by this issue. It is visible only on bare-metal configurations.

Revision history for this message
Al Bailey (albailey1974) wrote :

There is an unusual log which indicates the Package is staged in the wrong location

2021-12-16T15:07:50.000 localhost sudo: notice sysadmin : TTY=pts/1 ; PWD=/home/sysadmin ; USER=root ; COMMAND=/usr/bin/mv /home/sysadmin//Packages /var/www/pages/feed/rel-21.12/Packages

The /var/www/pages change is part of Debian conversion, but the STX 6.0 branch does not have these changes
so the feed should still be located at /www/pages/feed/rel-21.12/Packages

Revision history for this message
Al Bailey (albailey1974) wrote :

Alexandru,
 The Debian conversion changes for www merged around Dec 15
https://review.opendev.org/q/topic:%22var-www%22

Those changes are not in the stx/6.0 branch.

However it looks like some sort of automation commands are running sudo to mv the Packages to the debian location (/var/www/pages/feed/rel-21.12/Packages) however that branch still needs to find them under
/www/pages/feed/rel-21.12

Can you confirm if sanity/automation/pxeboot scripts had changed to accomodate the changes in master, thus breaking this env.

This might also explain why virtualbox works, since it is likely booting from the ISO, rather than trying to use a feed.

Revision history for this message
Al Bailey (albailey1974) wrote :

Also note, the latest stx 6.0 CENGN load did successfully start up patching on a hardware lab.
I think this needs to be assigned back to Alexandru. I believe the issue is the sanity setup.

Revision history for this message
Alexandru Dimofte (adimofte) wrote :

The automation scripts on our side are the same since a long time ago. We didn't changed anything on the setup or provisioning scripts. I understand that some changes for www were merged on Dec 15 but for this changes the helmcharts needs to be generated? The helmcharts from the latest stx6.0 image are also from Dec 15.

Revision history for this message
Alexandru Dimofte (adimofte) wrote :

And the setup is working fine, the issue appears during provisioning...

Revision history for this message
Alexandru Dimofte (adimofte) wrote :

And from this list https://review.opendev.org/q/topic:%22var-www%22 some changes were merged on Dec 16 or even Dec 21.

Revision history for this message
Al Bailey (albailey1974) wrote :

The reason this is a setup vs provisioning issue is that before ansible or any additional setup is invoked, there should be a /www/pages/feed/rel-21.12/ directory.

I need someone to verify the directory exists and what files it contains.

The collect in 1955329 shows that build has the fix for the layers
https://bugs.launchpad.net/starlingx/+bug/1955329/+attachment/5549144/+files/ALL_NODES_20211222.150052.tar

However: the bash log indicates that the setup routine believes the feed is located in an alternative location

2021-12-22T13:26:06.000 localhost -sh: info HISTORY: PID=149134 UID=42425 sudo -k mv /home/sysadmin//Packages /var/www/pages/feed/rel-21.12/Packages
2021-12-22T13:26:06.000 localhost -sh: info HISTORY: PID=149134 UID=42425 echo $?
2021-12-22T13:26:06.000 localhost -sh: info HISTORY: PID=149134 UID=42425 sudo -k mv /home/sysadmin//repodata /var/www/pages/feed/rel-21.12/repodata

Note: when I look at a bug raised in Novemeber, I do not see those lines in its bash.log
https://bugs.launchpad.net/starlingx/+bug/1952400/+attachment/5543469/+files/ALL_NODES_20211126.080704.tar

this was why I am asking if the steps are different and if the load is being setup the same way as master.

Revision history for this message
Alexandru Dimofte (adimofte) wrote :

[sysadmin@controller-0 ~(keystone_admin)]$ cat /etc/build.info | grep BUILD_ID -A6
BUILD_ID="r/stx.6.0"

JOB="STX_6.0_build_layer_flock"
<email address hidden>"
BUILD_NUMBER="5"
BUILD_HOST="starlingx_mirror"
BUILD_DATE="2021-12-21 23:17:55 +0000"
[sysadmin@controller-0 ~(keystone_admin)]$
[sysadmin@controller-0 ~(keystone_admin)]$ ls -al /www/pages/feed/rel-21.12/
total 672
drwxr-xr-x. 3 root root 4096 Jan 6 08:55 .
drwxr-xr-x. 3 root root 4096 Jan 6 08:53 ..
-r--r--r--. 1 root root 39217 Dec 21 23:36 controller_ks.cfg
-rw-r--r--. 1 root root 37 Jan 6 08:55 install_uuid
drwxr-xr-x. 2 root root 4096 Jan 6 08:53 LiveOS
-r--r--r--. 1 root root 49817 Dec 21 23:36 miniboot_controller_ks.cfg
-r--r--r--. 1 root root 56705 Dec 21 23:36 miniboot_smallsystem_ks.cfg
-r--r--r--. 1 root root 56706 Dec 21 23:36 miniboot_smallsystem_lowlatency_ks.cfg
-r--r--r--. 1 root root 43194 Dec 21 23:36 net_controller_ks.cfg
-r--r--r--. 1 root root 50082 Dec 21 23:36 net_smallsystem_ks.cfg
-r--r--r--. 1 root root 50083 Dec 21 23:36 net_smallsystem_lowlatency_ks.cfg
-r--r--r--. 1 root root 35700 Dec 21 23:36 net_storage_ks.cfg
-r--r--r--. 1 root root 38317 Dec 21 23:36 net_worker_ks.cfg
-r--r--r--. 1 root root 38318 Dec 21 23:36 net_worker_lowlatency_ks.cfg
-r--r--r--. 1 root root 40712 Dec 21 23:36 prestaged_installer_ks.cfg
-r--r--r--. 1 root root 46105 Dec 21 23:36 smallsystem_ks.cfg
-r--r--r--. 1 root root 46106 Dec 21 23:36 smallsystem_lowlatency_ks.cfg
[sysadmin@controller-0 ~(keystone_admin)]$
[sysadmin@controller-0 ~(keystone_admin)]$ ls -al /var/www/pages/feed/rel-21.12/Packages
ls: cannot access /var/www/pages/feed/rel-21.12/Packages: No such file or directory
[sysadmin@controller-0 ~(keystone_admin)]$
[sysadmin@controller-0 ~(keystone_admin)]$ ls -al /var/www/pages/
ls: cannot access /var/www/pages/: No such file or directory
[sysadmin@controller-0 ~(keystone_admin)]$

Revision history for this message
Austin Sun (sunausti) wrote :

just go though the history, it seems rel-21.12/repodata/ folder is missing in controller-0 . how this files are generated ? why this folder is missing ?
@Alex , is controller-0 installed via pxe ?

Revision history for this message
Alexandru Dimofte (adimofte) wrote :

I had a sync with Austin, and he observed that on the jumpserver, the setup scripts were used from master branch and the image from stx6.0. I searched and discovered that from the groovy scripts on our Jenkins it was by default triggered using master branch. I adapted the groovy scripts and triggered the sanity again using latest stx6.0 image from today. I will send later the results and I will comment here if the issue disappears.

Revision history for this message
Austin Sun (sunausti) wrote :

The suspect point is the auto test script is using master branch , which includes https://review.opendev.org/c/starlingx/test/+/810025 changes, it cause the repodata and Packages mv failure from home folder to /www/pages/. As synced with Alexandru, he will use stx.6.0 branch for test script, and trigger new test to conform it.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Assigning to Alex as he has the next action for this LP

Changed in starlingx:
assignee: Al Bailey (albailey1974) → Alexandru Dimofte (adimofte)
Revision history for this message
Alexandru Dimofte (adimofte) wrote :

This issue is no more observed. Provisioning step passed on all configurations. We can close this bug. Thanks!

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Closing; issue was with the sanity environment, not the software

Changed in starlingx:
status: In Progress → Invalid
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on update (master)

Change abandoned by "Al Bailey <email address hidden>" on branch: master
Review: https://review.opendev.org/c/starlingx/update/+/822197
Reason: The patch agent cannot work when the pxeboot feed is missing. This change cannot address that, it would merely make the log cleaner. Better to leave it messy, as there is nothing patching can do to resolve the issue

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.