StarlingX

stx-openstack apply failed after enabling cpu_dedicated_set

Bug #2002157 reported by OpenInfra on 2023-01-06

6

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	New	Medium	Unassigned

Bug Description

STX release 07 Dedicated Storage configured with multiple worker nodes.
One of the worker node with couple of VMs choosed to enable cpu_dedicated_set via the controller-0.
1. Guests/VMs were shutdown then lock the selected worker node.
2. Enabled cpu_dedicate_set on a worker node as per the documentation [1].
3. Unlock the worker
Then noticed that stx-openstack apply process was failed.

Here is the armada log: https://paste.opendev.org/show/b4z1jLSPbeEZDXyvFeBV/

There were couple of pods were failing (in the same worker node).
NAME READY STATUS RESTARTS AGE
nova-compute-worker-ov-01-cdc7009a-n95w9 1/2 CrashLoopBackOff 18 (94s ago) 72m
pci-irq-affinity-agent-2kgds 0/1 Init:0/1 0 24h

Deleted both pods and check if its recovering but no luck.

Collect Logs available at [2].

pci-irq-affinity-agent-2kgds pod log:
Error from server (BadRequest): container "pci-irq-affinity-agent" in pod "pci-irq-affinity-agent-2kgds" is waiting to start: PodInitializing

[1] https://docs.starlingx.io/admintasks/openstack/configure-dedicated-and-shared-cpu-pools-on-hosts.html

[2] https://drive.google.com/drive/folders/1YxuWOlkoHBsdg9UUuJ0ozZ2maxXcWzX-?usp=sharing

Tags:

Revision history for this message

OpenInfra (openinfra) wrote on 2023-01-06:

#1

further, on the same worker node Nvidia A40 card has been virtualized (using SR-IOV) and attached to both guest were running.
Both guests were working fine.
CPU_dedicated_set enabled after vGPU creation.

Thales Elero Cervi (tcervi) on 2023-01-06

tags:	added: stx.7.0 stx.distro.openstack
Changed in starlingx:
assignee:	nobody → Thales Elero Cervi (tcervi)

Revision history for this message

OpenInfra (openinfra) wrote on 2023-01-11:

#2

01. lock the host
02. reverted cpu pinnning (no application-isolation)
03. stx-openstack applied completed

Revision history for this message

OpenInfra (openinfra) wrote on 2023-01-11:

#3

+--------------------------+---------------------------------+-------------------------------------------+--------------------+----------+-----------+
| application | version | manifest name | manifest file | status | progress |
+--------------------------+---------------------------------+-------------------------------------------+--------------------+----------+-----------+
| cert-manager | 1.0-37 | cert-manager-fluxcd-manifests | fluxcd-manifests | applied | completed |
| nginx-ingress-controller | 1.1-38 | nginx-ingress-controller-fluxcd-manifests | fluxcd-manifests | applied | completed |
| oidc-auth-apps | 1.0-69 | oidc-auth-apps-fluxcd-manifests | fluxcd-manifests | uploaded | completed |
| platform-integ-apps | 1.0-53 | platform-integ-apps-fluxcd-manifests | fluxcd-manifests | applied | completed |
| rook-ceph-apps | 1.0-17 | rook-ceph-manifest | manifest.yaml | uploaded | completed |
| stx-openstack | 1.0-205-centos-stable-versioned | openstack-manifest | stx-openstack.yaml | applied | completed |
+--------------------------+---------------------------------+-------------------------------------------+--------------------+----------+-----------+

Revision history for this message

OpenInfra (openinfra) wrote on 2023-01-12:

#4

Still nova-compute and pci-irq-affinity-agent failing on the same node.

NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nova-compute-worker-ov-01-cdc7009a-crb2h 1/2 CrashLoopBackOff 8 (3m23s ago) 23m 192.168.204.147 worker-ov-01 <none> <none>
pci-irq-affinity-agent-rh5cq 0/1 Init:0/1 0 23m 172.16.231.222 worker-ov-01 <none>

Revision history for this message

Thales Elero Cervi (tcervi) wrote on 2024-03-11:

#5

Is this still an issue with stx.8.0 and/or stx.9.0?

Changed in starlingx:
assignee:	Thales Elero Cervi (tcervi) → nobody

Thales Elero Cervi (tcervi) on 2024-03-11

Changed in starlingx:
importance:	Undecided → Medium

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.