helm overrides fail due to calico networking issue

Bug #1877166 reported by Nimalini Rasa
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Matt Peters

Bug Description

Brief Description
-----------------
helm-override-update failed for oidc app in DC subcloud.

Severity
--------

Major

Steps to Reproduce
------------------
create Override for oidc app and apply
system helm-override-update --values /home/sysadmin/ssl/dex-overrides.yaml oidc-auth-apps dex kube-system

Expected Behavior
------------------
Helm override to be applied successfully

Actual Behavior
----------------
cmd failed:
Timeout while waiting on RPC response - topic: "sysinv.conductor_manager", RPC method: "merge_overrides" info: "<unknown>"

Reproducibility
---------------
Intermittent

System Configuration
--------------------
One node system, IPV6, DC subcloud

Branch/Pull Time/Commit
-----------------------
2020-05-05

Last Pass
---------
N/A

Timestamp/Logs
--------------
2020-05-06T14:34:25.000 (cmd issued)

sysinv 2020-05-06 14:34:35.159 95968 ERROR wsme.api [-] Server-side error: "Timeout while waiting on RPC response - topic: "sysinv.conductor_manager", RPC method: "merge_overrides" info: "<unknown>"". Detail:
Traceback (most recent call last):

  File "/usr/lib/python2.7/site-packages/wsmeext/pecan.py", line 85, in callfunction
    result = f(self, *args, **kwargs)

  File "/usr/lib64/python2.7/site-packages/sysinv/api/controllers/v1/helm_charts.py", line 202, in patch
    set_overrides=set_overrides)

  File "/usr/lib64/python2.7/site-packages/sysinv/conductor/rpcapi.py", line 1677, in merge_overrides
    set_overrides=set_overrides))

  File "/usr/lib64/python2.7/site-packages/sysinv/openstack/common/rpc/proxy.py", line 126, in call
    exc.info, real_topic, msg.get('method'))

Timeout: Timeout while waiting on RPC response - topic: "sysinv.conductor_manager", RPC method: "merge_overrides" info: "<unknown>"

Test Activity
-------------
System Test

Revision history for this message
Nimalini Rasa (nrasa) wrote :

helm ls
Error: Get https://[fd04::1]:443/api/v1/namespaces/kube-system/configmaps?labelSelector=OWNER%!D(MISSING)TILLER: dial tcp [fd04::1]:443: i/o timeout

Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: nobody → Bob Church (rchurch)
tags: added: stx.containers
Revision history for this message
Matt Peters (mpeters-wrs) wrote :

After further investigation, the networking issue is actually caused by the following LP:
https://bugs.launchpad.net/starlingx/+bug/1877383

The incorrect endpoint configuration is causing the kube-apiserver access to require access to the OAM network over port 6443. This access is only permitted from our current GlobalNetworkPolicy that we have configured that enables TCP egress traffic and TCP ingress 6443. As a result, while it is applying the iptables rules, there is a window of when it has setup the ingress and egress host endpoint rules (which has the DROP rules) and the time when it configures the policy specific rules.

To protect against this condition, we can update the Calico failsafe rules to include the kube-apiserver port of 6443 to ensure that it will always have access to K8s to populate the iptables rules, even if there is a misconfiguration.

# Configure inbound failsafe rules
- name: FELIX_FAILSAFEINBOUNDHOSTPORTS
  value: "tcp:22, udp:68, tcp:179, tcp:6443"
# Configure output failsafe rules
- name: FELIX_FAILSAFEOUTBOUNDHOSTPORTS
  value: "udp:53, udp:67, tcp:179, tcp:6443"

Ghada Khalil (gkhalil)
tags: added: stx.4.0
Changed in starlingx:
importance: Undecided → High
status: New → Triaged
Ghada Khalil (gkhalil)
summary: - Helm override update failed for Oidc app
+ helm overrides fail due to calico networking issue
Ghada Khalil (gkhalil)
Changed in starlingx:
status: Triaged → In Progress
assignee: Bob Church (rchurch) → Matt Peters (mpeters-wrs)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (master)

Fix proposed to branch: master
Review: https://review.opendev.org/726231

Ghada Khalil (gkhalil)
tags: added: stx.networking
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/726231
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=26fd273cf5175ba4bdd31d6b6b777814f1a6c860
Submitter: Zuul
Branch: master

commit 26fd273cf5175ba4bdd31d6b6b777814f1a6c860
Author: Matt Peters <email address hidden>
Date: Thu May 7 14:29:02 2020 -0500

    Add kube-apiserver port to calico failsafe rules

    An invalid GlobalNetworkPolicy or NetworkPolicy may prevent
    calico-node from communicating with the kube-apiserver.
    Once the communication is broken, calico-node is no longer
    able to update the policies since it cannot communicate to
    read the updated policies. It can also prevent the pod
    from starting since the policies will prevent it from
    reading the configuration.

    To ensure that this scenario does not happen, the kube-apiserver
    port is being added to the failsafe rules to ensure communication
    is always possible, regardless of the network policy configuration.

    Change-Id: I1b065a74e7ad0ba9b1fdba4b63136b97efbe98ce
    Closes-Bug: 1877166
    Related-Bug: 1877383
    Signed-off-by: Matt Peters <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/729809

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (f/centos8)
Download full text (22.6 KiB)

Reviewed: https://review.opendev.org/729809
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=73027425d4501a6b7785e91024c9e8ddbc03115d
Submitter: Zuul
Branch: f/centos8

commit 55c9afd075194f7669fa2a87e546f61034679b04
Author: Dan Voiculeasa <email address hidden>
Date: Wed May 13 14:19:52 2020 +0300

    Restore: disconnect etcd from ceph

    At the moment etcd is restored only if ceph data is kept.
    Etcd should be restored regardless if ceph data is kept or wiped.

    Story: 2006770
    Task 39751
    Change-Id: I9dfb1be0a83c3fdc5f1b29cbb974c5e0e2236ad3
    Signed-off-by: Dan Voiculeasa <email address hidden>

commit 003ddff574c74adf11cf8e4758e93ba0eed45a6a
Author: Don Penney <email address hidden>
Date: Fri May 8 11:35:58 2020 -0400

    Add playbook for updating static images

    This commit introduces a new playbook, upgrade-static-images.yml, used
    for downloading updating images and pushing to the local registry.

    Change-Id: I8884440261a5a4e27b40398e5a75c9d03b09d4ba
    Story: 2006781
    Task: 39706
    Signed-off-by: Don Penney <email address hidden>

commit 26fd273cf5175ba4bdd31d6b6b777814f1a6c860
Author: Matt Peters <email address hidden>
Date: Thu May 7 14:29:02 2020 -0500

    Add kube-apiserver port to calico failsafe rules

    An invalid GlobalNetworkPolicy or NetworkPolicy may prevent
    calico-node from communicating with the kube-apiserver.
    Once the communication is broken, calico-node is no longer
    able to update the policies since it cannot communicate to
    read the updated policies. It can also prevent the pod
    from starting since the policies will prevent it from
    reading the configuration.

    To ensure that this scenario does not happen, the kube-apiserver
    port is being added to the failsafe rules to ensure communication
    is always possible, regardless of the network policy configuration.

    Change-Id: I1b065a74e7ad0ba9b1fdba4b63136b97efbe98ce
    Closes-Bug: 1877166
    Related-Bug: 1877383
    Signed-off-by: Matt Peters <email address hidden>

commit bd0f14a7dfb206ccaa3ce0f5e7d9034703b3403c
Author: Robert Church <email address hidden>
Date: Tue May 5 15:11:15 2020 -0400

    Provide an update strategy for Tiller deployment

    In the case of a simplex controller configuration the current patching
    strategy for the Tiller environment will fail as the tiller ports will
    be in use when the new deployment is attempted to be applied. The
    resulting tiller pod will be stuck in a Pending state.

    This will be observed if the node becomes ready after 'helm init'
    installs the initial deployment and before the deployment is patched for
    environment checks.

    The deployment strategy provided by 'helm init' is unspecified. This
    change will allow one additional pod (current + new) and one unavailable
    pod (current) during an update. The maxUnavailable setting allows the
    tiller pod to be deleted which will release its ports, thus allowing the
    patch deployment to spin up an new pod to a Running state.

    Change-Id: I83c43c52a77...

tags: added: in-f-centos8
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.