stx-openstack apply is stuck at 53%. Error while installing release osh-openstack-pci-irq-affinity-agent: grpc._channel._InactiveRpcError

Bug #1951245 reported by Alexandru Dimofte
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Heitor Matsui

Bug Description

Brief Description
-----------------
stx-openstack apply is stuck at 53%. Error while installing release osh-openstack-pci-irq-affinity-agent: grpc._channel._InactiveRpcError <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.UNKNOWN
        details = "release osh-openstack-pci-irq-affinity-agent failed: timed out waiting for the condition"

Severity
--------
<Critical: System/Feature is not usable due to the defect>

Steps to Reproduce
------------------
Try to install StarlingX

Expected Behavior
------------------
The installation should work fine

Actual Behavior
----------------
stx-openstack apply is stuck at 53%.

Reproducibility
---------------
100% reproducible

System Configuration
--------------------
Till now this was observed on Baremetal: Duplex, Standard, Standard External

Branch/Pull Time/Commit
-----------------------
20211117T032111Z

Last Pass
---------
20211116T021917Z

Timestamp/Logs
--------------
Log will be attached

Test Activity
-------------
Sanity

Workaround
----------
-

Revision history for this message
Alexandru Dimofte (adimofte) wrote :
Revision history for this message
Alexandru Dimofte (adimofte) wrote :

This logs were extracted from baremetal Standard configuration

Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: nobody → Thiago Paiva Brito (outbrito)
importance: Undecided → High
status: New → Triaged
tags: added: stx.6.0 stx.distro.openstack
Revision history for this message
Thiago Paiva Brito (outbrito) wrote :

This looks very similar to https://review.opendev.org/c/starlingx/vault-armada-app/+/772616 and we are indeed missing the imagePullSecrets on the new pci-irq-affinity-agent chart.

Maybe this was oversaw due to local tests and local image builds...

We are testing and will open a change in a bit.

Changed in starlingx:
assignee: Thiago Paiva Brito (outbrito) → nobody
Revision history for this message
Heitor Matsui (heitormatsui) wrote :

Looking further into the problem the pci-irq-affinity pod is failing with a permission denied error while trying to pull the image:

  Warning Failed 23s (x2 over 37s) kubelet Failed to pull image "registry.local:9001/docker.io/starlingx/stx-pci-irq-affinity-agent:master-centos-stable-20211116T032946Z.0": rpc error: code = Unknown desc = failed to pull and unpack image "registry.local:9001/docker.io/starlingx/stx-pci-irq-affinity-agent:master-centos-stable-20211116T032946Z.0": failed to resolve reference "registry.local:9001/docker.io/starlingx/stx-pci-irq-affinity-agent:master-centos-stable-20211116T032946Z.0": pull access denied, repository does not exist or may require authorization: server message: insufficient_scope: authorization failed
  Warning Failed 23s (x2 over 37s) kubelet Error: ErrImagePull
  Normal BackOff 12s (x3 over 37s) kubelet Back-off pulling image "registry.local:9001/docker.io/starlingx/stx-pci-irq-affinity-agent:master-centos-stable-20211116T032946Z.0"
  Warning Failed 12s (x3 over 37s) kubelet Error: ImagePullBackOff

That was caused by the missing imagePullSecrets mentioned before.

Changed in starlingx:
assignee: nobody → Heitor Matsui (heitormatsui)
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to openstack-armada-app (master)
Changed in starlingx:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to openstack-armada-app (master)

Reviewed: https://review.opendev.org/c/starlingx/openstack-armada-app/+/818256
Committed: https://opendev.org/starlingx/openstack-armada-app/commit/1747cd2dd20c2e1cf6c6d35733f19e669faf2b8d
Submitter: "Zuul (22348)"
Branch: master

commit 1747cd2dd20c2e1cf6c6d35733f19e669faf2b8d
Author: Heitor Matsui <email address hidden>
Date: Wed Nov 17 14:53:18 2021 -0300

    Create ServiceAccount for pci-irq-affinity-agent

    The stx-openstack apply is failing with a timeout condition,
    caused by failure while trying to pull the image of the agent from
    the local registry. This is caused by missing imagePullSecrets on
    the daemonset definition, so this commit creates the ServiceAccount
    with the imagePullSecrets and use it on the spec.

    Closes-Bug: 1951245
    Change-Id: I5e408539e001bbe867812db8c614d47cea1ae225
    Signed-off-by: Heitor Matsui <email address hidden>

Changed in starlingx:
status: Fix Committed → Fix Released
Revision history for this message
Alexandru Dimofte (adimofte) wrote :

Today I extracted from Duplex configuration:
  Normal Scheduled 52m default-scheduler Successfully assigned openstack/pci-irq-affinity-agent-prj2q to controller-1 │
│ Warning FailedMount 50m kubelet, controller-1 Unable to attach or mount volumes: unmounted volumes=[keyring], unattached volumes=[libvirt-sock-ro compute-pci-devic │
│ s compute-irq pci-irq-affinity-agent-bin kube-api-access-8fvh9 pci-irq-affinity-agent-etc keyring[]: timed out waiting for the condition │
│ Warning FailedMount 34m kubelet, controller-1 Unable to attach or mount volumes: unmounted volumes=[keyring], unattached volumes=[kube-api-access-8fvh9 pci-irq-aff │
│ nity-agent-etc keyring libvirt-sock-ro compute-pci-devices compute-irq pci-irq-affinity-agent-bin[]: timed out waiting for the condition │
│ Warning FailedMount 32m (x3 over 43m) kubelet, controller-1 Unable to attach or mount volumes: unmounted volumes=[keyring], unattached volumes=[compute-pci-devices compute-irq p │
│ i-irq-affinity-agent-bin kube-api-access-8fvh9 pci-irq-affinity-agent-etc keyring libvirt-sock-ro[]: timed out waiting for the condition │
│ Warning FailedMount 30m (x4 over 45m) kubelet, controller-1 Unable to attach or mount volumes: unmounted volumes=[keyring], unattached volumes=[compute-irq pci-irq-affinity-agen │
│ -bin kube-api-access-8fvh9 pci-irq-affinity-agent-etc keyring libvirt-sock-ro compute-pci-devices[]: timed out waiting for the condition │
│ Warning FailedMount 12m (x3 over 21m) kubelet, controller-1 Unable to attach or mount volumes: unmounted volumes=[keyring], unattached volumes=[pci-irq-affinity-agent-bin kube-a │
│ i-access-8fvh9 pci-irq-affinity-agent-etc keyring libvirt-sock-ro compute-pci-devices compute-irq[]: timed out waiting for the condition │
│ Warning FailedMount 7m30s (x3 over 48m) kubelet, controller-1 Unable to attach or mount volumes: unmounted volumes=[keyring], unattached volumes=[pci-irq-affinity-agent-etc keyrin │
│ libvirt-sock-ro compute-pci-devices compute-irq pci-irq-affinity-agent-bin kube-api-access-8fvh9[]: timed out waiting for the condition │
│ Warning FailedMount 95s (x33 over 52m) kubelet, controller-1 MountVolume.SetUp failed for volume "keyring" : hostPath type check failed: /opt/platform/.keyring/21.12/python_keyri │
│ ng/crypted_pass.cfg is not a file

Revision history for this message
Alexandru Dimofte (adimofte) wrote :
Download full text (3.8 KiB)

On standard and standard ext I observed this issue:
fm-rest-api ++ sed -nr 's/^sql_connection[[:space:]]?=[[:space:]]?//p' /etc/fm/fm.conf │
│ fm-rest-api + export 'SQL_CONNECTION=mysql+pymysql://fm:T*<email address hidden>:3306/fm' │
│ fm-rest-api + SQL_CONNECTION='mysql+pymysql://fm:T*<email address hidden>:3306/fm' │
│ fm-rest-api + echo 'mysql+pymysql://fm:T*<email address hidden>:3306/fm' │
│ fm-rest-api + python /usr/local/bin/fm_db_sync_event_suppression.py 'mysql+pymysql://fm:T*<email address hidden>:3306/fm' │
│ fm-rest-api Traceback (most recent call last): │
│ fm-rest-api File "/usr/local/bin/fm_db_sync_event_suppression.py", line 28, in <module> │
│ fm-rest-api LOG = log.get_logger(__name__) │
│ fm-rest-api File "/usr/local/bin/fm_log.py", line 23, in get_logger │
│ fm-rest-api setup_logger(_loggers[name]) │
│ fm-rest-api File "/usr/local/bin/fm_log.py", line 38, in setup_logger │
│ fm-rest-api facility=syslog_facility) │
│ fm-rest-api File "/usr/lib64/python2.7/logging/handlers.py", line 761, in __init__ │
│ fm-rest-api self._connect_unixsocket(address) │
│ fm-rest-api File "/usr/lib64/python2.7/logging/handlers.py", line 789, in _connect_unixsocket │
│ fm-rest-api self.socket.connect(address) │
│ fm-rest-api File "/usr/lib64/python2.7/socket.py", line 224, in meth ...

Read more...

Revision history for this message
Alexandru Dimofte (adimofte) wrote :
Revision history for this message
Heitor Matsui (heitormatsui) wrote :

Can you please attach the duplex logs? From the pci-irq-affinity-agent pod on the standard setup logs I see the pods were raised successfully:

From compute-0 pci-irq-affinity-agent pod log:
---
2021-11-18T07:33:23.814201633Z stdout F 2021-11-18 07:33:23,813 MainThread[9] pci-interrupt-affinity./var/lib/openstack/lib/python2.7/site-packages/pci_irq_affinity/agent.py.176 - INFO Enter PCIInterruptAffinity Agent
2021-11-18T07:33:23.887842747Z stdout F 2021-11-18 07:33:23,883 MainThread[9] pci-interrupt-affinity./var/lib/openstack/lib/python2.7/site-packages/pci_irq_affinity/agent.py.155 - INFO rabbit://nova-rabbitmq-user:f7jLxgCH!<email address hidden>:5672/nova
2021-11-18T07:33:24.163779518Z stdout F 2021-11-18 07:33:24,147 MainThread[9] pci-interrupt-affinity./var/lib/openstack/lib/python2.7/site-packages/pci_irq_affinity/agent.py.168 - INFO Rabbitmq Client Started!
2021-11-18T07:33:54.184169801Z stderr F No handlers could be found for logger "oslo_messaging.server"
---

compute-1 pci-irq-affinity-agent pod log:
---
2021-11-18T07:33:25.442231377Z stdout F 2021-11-18 07:33:25,442 MainThread[9] pci-interrupt-affinity./var/lib/openstack/lib/python2.7/site-packages/pci_irq_affinity/agent.py.176 - INFO Enter PCIInterruptAffinity Agent
2021-11-18T07:33:25.496219458Z stdout F 2021-11-18 07:33:25,495 MainThread[9] pci-interrupt-affinity./var/lib/openstack/lib/python2.7/site-packages/pci_irq_affinity/agent.py.155 - INFO rabbit://nova-rabbitmq-user:f7jLxgCH!<email address hidden>:5672/nova
2021-11-18T07:33:25.748741984Z stdout F 2021-11-18 07:33:25,731 MainThread[9] pci-interrupt-affinity./var/lib/openstack/lib/python2.7/site-packages/pci_irq_affinity/agent.py.168 - INFO Rabbitmq Client Started!
2021-11-18T07:33:55.758618371Z stderr F No handlers could be found for logger "oslo_messaging.server"
---

Revision history for this message
Alexandru Dimofte (adimofte) wrote :
Revision history for this message
Heitor Matsui (heitormatsui) wrote :

The fm-rest-api issue is not associated to this bug and seems related to this commit on starlingx/fault repo: 54d7e692

Revision history for this message
Alexandru Dimofte (adimofte) wrote :

Ok, then for fm-rest-api issue I will open another LP.

Revision history for this message
Heitor Matsui (heitormatsui) wrote :

For the duplex logs, we see some evictions due to disk pressure during the apply, but the pci-irq-affinity-agent pod manages to run in controller-0 after a while:

2021-11-18T10:22:17Z pci-irq-affinity-agent-prj2q Pod Successfully assigned openstack/pci-irq-affinity-agent-prj2q to controller-1 Scheduled Normal
2021-11-18T10:22:17Z pci-irq-affinity-agent-nq688 Pod Stopping container pci-irq-affinity-agent Killing Normal
2021-11-18T10:22:17Z pci-irq-affinity-agent-k85d7 Pod Successfully assigned openstack/pci-irq-affinity-agent-k85d7 to controller-0 Scheduled Normal
2021-11-18T10:22:17Z pci-irq-affinity-agent DaemonSet Created pod: pci-irq-affinity-agent-k85d7 SuccessfulCreate Normal
2021-11-18T10:22:17Z pci-irq-affinity-agent-prj2q Pod MountVolume.SetUp failed for volume "keyring" : hostPath type check failed: /opt/platform/.keyring/21.12/python_keyring/crypted_pass.cfg is not a file FailedMount Warning
2021-11-18T10:22:17Z pci-irq-affinity-agent DaemonSet Created pod: pci-irq-affinity-agent-prj2q SuccessfulCreate Normal
2021-11-18T10:22:18Z pci-irq-affinity-agent-k85d7 Pod Add eth0 [172.16.192.111/32] AddedInterface Normal
2021-11-18T10:22:18Z pci-irq-affinity-agent-k85d7 Pod Started container pci-irq-affinity-agent Started Normal
2021-11-18T10:22:18Z pci-irq-affinity-agent-k85d7 Pod Created container pci-irq-affinity-agent Created Normal
2021-11-18T10:22:18Z pci-irq-affinity-agent-k85d7 Pod Container image "registry.local:9001/docker.io/starlingx/stx-pci-irq-affinity-agent:master-centos-stable-20211116T032946Z.0" already present on machine Pulled Normal

But what calls my attention here is that on controller-1 it fails because it's unable to mount the keyring volume, which is needed for the agent to get information to communicate with the openstack services, in this line:

2021-11-18T10:22:17Z pci-irq-affinity-agent-prj2q Pod MountVolume.SetUp failed for volume "keyring" : hostPath type check failed: /opt/platform/.keyring/21.12/python_keyring/crypted_pass.cfg is not a file FailedMount Warning

This missing file may cause other problems other than only the openstack apply failure, and this file should not be missing on one of the hosts. the problem that was happening yesterday that was causing the pull image failure is not happening anymore after the fix, since the pod was raised on the standard setup and on one of the controllers in the duplex setup.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to utilities (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/utilities/+/818620

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to openstack-armada-app (master)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to utilities (master)

Reviewed: https://review.opendev.org/c/starlingx/utilities/+/818620
Committed: https://opendev.org/starlingx/utilities/commit/46a3e78c640d583879779d908b8db705f661af9c
Submitter: "Zuul (22348)"
Branch: master

commit 46a3e78c640d583879779d908b8db705f661af9c
Author: Heitor Matsui <email address hidden>
Date: Fri Nov 19 15:22:21 2021 -0300

    Use keyring only if password is not on config file

    This commit enables the agent to get the password from the
    agent config file, using the python keyring service only if
    the password is not available on it.

    Closes-Bug: 1951245
    Signed-off-by: Heitor Matsui <email address hidden>
    Change-Id: I9075c2f40cde9af791d7cb9af8e293d6d61119a5

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to openstack-armada-app (master)

Reviewed: https://review.opendev.org/c/starlingx/openstack-armada-app/+/818621
Committed: https://opendev.org/starlingx/openstack-armada-app/commit/2755a004578d0e22dc94bdd3bf683a11b234f0d4
Submitter: "Zuul (22348)"
Branch: master

commit 2755a004578d0e22dc94bdd3bf683a11b234f0d4
Author: Heitor Matsui <email address hidden>
Date: Fri Nov 19 14:47:22 2021 -0300

    Get admin password and use it on config template

    The pci-irq-affinity-agent uses the platform keyring file to
    fetch the admin password to communicate with OpenStack services,
    but now that the agent is raised during the application apply it
    can use the same approach the other helm charts use to get the
    credentials, using it's plugin to capture the information and
    generate the system overrides with the values.

    This commit:
    - Changes the plugin to get the credentials
    - Changes the pci-irq-affinity-agent helm chart to include the
    password on the chart values and use it on the agent config file
    template
    - Adds an init container with dependencies on libvirt and nova
    compute pods, so that the agent pod is only create when those
    are available
    - Removes the keyring mount on the container, which will not be
    needed anymore with the previous changes and is causing failure
    when raising the pod
    - Removes additional keyring tools

    Depends-On: https://review.opendev.org/c/starlingx/utilities/+/818620
    Closes-Bug: 1951245
    Signed-off-by: Heitor Matsui <email address hidden>
    Change-Id: I26f993146b8a17b7602a45f0cd5d983c1d93b0c1

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.