stx-openstack apply failed after enabling cpu_dedicated_set

Bug #2002157 reported by OpenInfra
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
New
Medium
Unassigned

Bug Description

STX release 07 Dedicated Storage configured with multiple worker nodes.
One of the worker node with couple of VMs choosed to enable cpu_dedicated_set via the controller-0.
1. Guests/VMs were shutdown then lock the selected worker node.
2. Enabled cpu_dedicate_set on a worker node as per the documentation [1].
3. Unlock the worker
Then noticed that stx-openstack apply process was failed.

Here is the armada log: https://paste.opendev.org/show/b4z1jLSPbeEZDXyvFeBV/

There were couple of pods were failing (in the same worker node).
NAME READY STATUS RESTARTS AGE
nova-compute-worker-ov-01-cdc7009a-n95w9 1/2 CrashLoopBackOff 18 (94s ago) 72m
pci-irq-affinity-agent-2kgds 0/1 Init:0/1 0 24h

Deleted both pods and check if its recovering but no luck.

Collect Logs available at [2].

pci-irq-affinity-agent-2kgds pod log:
Error from server (BadRequest): container "pci-irq-affinity-agent" in pod "pci-irq-affinity-agent-2kgds" is waiting to start: PodInitializing

[1] https://docs.starlingx.io/admintasks/openstack/configure-dedicated-and-shared-cpu-pools-on-hosts.html

[2] https://drive.google.com/drive/folders/1YxuWOlkoHBsdg9UUuJ0ozZ2maxXcWzX-?usp=sharing

Revision history for this message
OpenInfra (openinfra) wrote :

further, on the same worker node Nvidia A40 card has been virtualized (using SR-IOV) and attached to both guest were running.
Both guests were working fine.
CPU_dedicated_set enabled after vGPU creation.

tags: added: stx.7.0 stx.distro.openstack
Changed in starlingx:
assignee: nobody → Thales Elero Cervi (tcervi)
Revision history for this message
OpenInfra (openinfra) wrote :

01. lock the host
02. reverted cpu pinnning (no application-isolation)
03. stx-openstack applied completed

Revision history for this message
OpenInfra (openinfra) wrote :

+--------------------------+---------------------------------+-------------------------------------------+--------------------+----------+-----------+
| application | version | manifest name | manifest file | status | progress |
+--------------------------+---------------------------------+-------------------------------------------+--------------------+----------+-----------+
| cert-manager | 1.0-37 | cert-manager-fluxcd-manifests | fluxcd-manifests | applied | completed |
| nginx-ingress-controller | 1.1-38 | nginx-ingress-controller-fluxcd-manifests | fluxcd-manifests | applied | completed |
| oidc-auth-apps | 1.0-69 | oidc-auth-apps-fluxcd-manifests | fluxcd-manifests | uploaded | completed |
| platform-integ-apps | 1.0-53 | platform-integ-apps-fluxcd-manifests | fluxcd-manifests | applied | completed |
| rook-ceph-apps | 1.0-17 | rook-ceph-manifest | manifest.yaml | uploaded | completed |
| stx-openstack | 1.0-205-centos-stable-versioned | openstack-manifest | stx-openstack.yaml | applied | completed |
+--------------------------+---------------------------------+-------------------------------------------+--------------------+----------+-----------+

Revision history for this message
OpenInfra (openinfra) wrote :

Still nova-compute and pci-irq-affinity-agent failing on the same node.

NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nova-compute-worker-ov-01-cdc7009a-crb2h 1/2 CrashLoopBackOff 8 (3m23s ago) 23m 192.168.204.147 worker-ov-01 <none> <none>
pci-irq-affinity-agent-rh5cq 0/1 Init:0/1 0 23m 172.16.231.222 worker-ov-01 <none>

Revision history for this message
Thales Elero Cervi (tcervi) wrote :

Is this still an issue with stx.8.0 and/or stx.9.0?

Changed in starlingx:
assignee: Thales Elero Cervi (tcervi) → nobody
Changed in starlingx:
importance: Undecided → Medium
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.