StarlingX

neutron-ovs-agent-controller-1 issue: MountVolume.SetUp failed for volume "neutron-etc" : failed to sync secret cache: timed out waiting for the condition

Bug #1958073 reported by Alexandru Dimofte on 2022-01-16

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	High	Thiago Paiva Brito

Bug Description

Brief Description
-----------------
There are 2 kubernetes pods failing neutron-ovs-agent-controller-0 and neutron-ovs-agent-controller-1
issue: MountVolume.SetUp failed for volume "neutron-etc" : failed to sync secret cache: timed out waiting for the condition
I observed this issue on 2 bare-metal configurations (Duplex and Standard).

Severity
--------
<Major: System/Feature is usable but degraded>

Steps to Reproduce
------------------
Try to install latest image from master branch(20220116T031725Z), then lock/unlock the controllers and computes.

Expected Behavior
------------------
The neutron-ovs-agent-controller-0-xxx and neutron-ovs-agent-controller-1-xxx pods status should be "running".

Actual Behavior
----------------
The above pods are failing/crashing.

Events: │
│ Type Reason Age From Message │
│ ---- ------ ---- ---- ------- │
│ Normal Scheduled 3h11m default-scheduler Successfully assigned openstack/neutron-ovs-agent-controller-0-937646f6-prprb to controller-0 │
│ Normal Pulled 3h11m kubelet, controller-0 Container image "registry.local:9001/quay.io/airshipit/kubernetes-entrypoint:v1.0.0" already present on machine │
│ Normal Created 3h11m kubelet, controller-0 Created container init │
│ Normal Started 3h11m kubelet, controller-0 Started container init │
│ Normal Pulled 3h10m kubelet, controller-0 Container image "registry.local:9001/docker.io/starlingx/stx-neutron:master-centos-stable-20220113T034723Z.0" al │
│ ready present on machine

Warning FailedMount 37m (x2 over 37m) kubelet MountVolume.SetUp failed for volume "neutron-etc" : failed to sync secret cache: timed out waiting for the condition

Warning Failed 36m kubelet Error: failed to prepare subPath for volumeMount "neutron-bin" of container "neutron-ovs-agent"

...
dump_pods_info(con_ssh=con_ssh)
> raise exceptions.KubeError(msg)
E utils.exceptions.KubeError: Kubernetes error.
E Details: Some pods are not Running or Completed: {'pci-irq-affinity-agent-phf44': 'Init:0/1'}

Reproducibility
---------------
I don't know yet if this is 100%, I guess not.

System Configuration
--------------------
Two node system, Multi-node system

Branch/Pull Time/Commit
-----------------------
master 20220116T031725Z

Last Pass
---------
20220113T023728Z

Timestamp/Logs
--------------
Will be attached

Test Activity
-------------
Sanity

Workaround
----------
-

Tags:

Revision history for this message

Alexandru Dimofte (adimofte) wrote on 2022-01-16:

I attached the collected logs Edit (97.2 MiB, application/x-tar)

Ghada Khalil (gkhalil) on 2022-01-17

tags:

added: stx.distro.openstack

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2022-01-18 (last edit on 2022-01-18):

screening: As per Thiago Brito, this issue is related to recent code changes submitted for https://storyboard.openstack.org/#!/story/2009702
The changes appear to be in the stx master branch only, so should not affect r/stx.6.0

tags:	added: stx.7.0
Changed in starlingx:
importance:	Undecided → High
status:	New → Triaged
assignee:	nobody → Thiago Paiva Brito (outbrito)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-01-19: Fix proposed to stx-puppet (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/825398

Changed in starlingx:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-01-21: Fix merged to stx-puppet (master)

Reviewed: https://review.opendev.org/c/starlingx/stx-puppet/+/825398
Committed: https://opendev.org/starlingx/stx-puppet/commit/e00edd94f32f09f07e45b0ee3752d097d3a8f844
Submitter: "Zuul (22348)"
Branch: master

commit e00edd94f32f09f07e45b0ee3752d097d3a8f844
Author: Thiago Brito <email address hidden>
Date: Wed Jan 19 17:26:16 2022 -0300

Fix resource lookup for ovs

    On change ae635b5b80fcb61c429a6fc17961a9f3bf614964, the vswitch_class
    was changed to ovs_dpdk, but the resources created by sysinv at [1]
    are at the platform::vswitch::ovs:: lookup. This mismatch is failing
    lookup and the bridges for the underlying datanetworks aren't being
    created when puppet runs. As a result, the neutron-ovs-agent pods are
    failing with CrashLoopBackoff. This commit fixes it by reverting the
    resources to the correct lookup path on hiera.

[1] https://opendev.org/starlingx/config/src/commit/ece13f740847f3bcc7470cc7ec8c1896dd61f014/sysinv/sysinv/sysinv/sysinv/puppet/ovs.py#L108

    TEST PLAN
    PASS ovs-dpdk: Clean install of the Starlingx ISO verified that the
         br-phy* bridges were created for the underlying datanetworks
         using ovs-vsctl on the host
    PASS ovs-dpdk: Installation of stx-openstack is successful
    PASS ovs-dpdk: Created project networks and instances
    PASS ovs: Clean install of the Starlingx ISO
    PASS ovs: Installation of stx-openstack is successful and the
         openvswitchd and ovs-agent pods are running OK
    PASS ovs: Created project networks and instances

    Closes-Bug: #1958073
    Signed-off-by: Thiago Brito <email address hidden>
    Change-Id: I53e16df5403fa7c7f82b8e67e3e5d18a2103d599