helm cmd failed after host swact

Bug #1875891 reported by Yang Liu
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Bob Church

Bug Description

Brief Description
-----------------
After a few host swacts, helm cmd failed with following error.

[sysadmin@controller-1 ~(keystone_admin)]$ helm ls
Error: forwarding ports: error upgrading connection: unable to upgrade connection: error dialing backend: dial tcp [2620:10a:a001:a103::11]:40870: connect: connection timed out

Severity
--------
Major

Steps to Reproduce
------------------
(Not sure if sysadmin password change plays a role here, but these steps were performed before this issue was seen)
- change sysadmin password
- swact
- revert sysadmin password (need to change multiple times in order to revert due to password rules restriction)

TC-name: test_sysadmin_aging_and_swact[swact]

Expected Behavior
------------------
- system works as expected

Actual Behavior
----------------
- helm ls cmd hung for long time and eventually returns error.

Reproducibility
---------------
Intermittent

System Configuration
--------------------
AIO-DX
Lab-name: r430-3-4, wcp-78-79

Branch/Pull Time/Commit
-----------------------
stx master as of 2020-04-27

Last Pass
---------
Lab: WCP_7_10
Load: 2020-04-17_10-33-46

This is an intermittent issue, so not sure if last pass is accurate.

Timestamp/Logs
--------------
[2020-04-29 02:48:19,046] 460 INFO MainThread test_linux_user_password_aging.execute_cmd:: Sending cmd:source /etc/platform/openrc; system host-swact controller-0

Test Activity
-------------
Regression Testing

Revision history for this message
Yang Liu (yliu12) wrote :
Frank Miller (sensfan22)
Changed in starlingx:
assignee: nobody → Bob Church (rchurch)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (master)

Fix proposed to branch: master
Review: https://review.opendev.org/724384

Changed in starlingx:
status: New → In Progress
Ghada Khalil (gkhalil)
tags: added: stx.4.0 stx.containers
Changed in starlingx:
importance: Undecided → High
Revision history for this message
Bob Church (rchurch) wrote :
Download full text (8.3 KiB)

Analysis:
- [1] was done to ensure that fixed IPs are used for the cluster and mgmt
   networks when multiple (floating) IPs are present when in an IPv6 config
   - We don't do this for the OAM network so after controller-0 unlocks we end up
     using the OAM floating address instead of the fixed host address when
     containerd starts.
   - When controller-1 is unlocked, it is the standby controller with only one
    OAM specific host address so containerd starts successfully with the
     correct IP
 - The containerd template does not set the stream_server_address [2].
   - Should we be providing an address here that aligns to the containerd
     address that we want it to listen to (i.e. the OAM host specific address)
   - Do we want to use the OAM address here? Should this be the MGMT host
     specific address or loopback?

I suspect that if you restarted containerd on controller-0 (currently standby)
it would get the correct address and things would work correctly.

We should specify a specific address for containerd to bind to avoid IP
discovery when starting up the containerd process.

[1] https://review.opendev.org/#/c/715120/
[2] https://opendev.org/starlingx/stx-puppet/src/branch/master/puppet-manifests/src/modules/platform/templates/config.toml.erb#L29

Look at the lab and related logs:

# Ansible sets up the following:
external_oam_subnet: 2620:10a:a001:a103::6:0/64
external_oam_gateway_address: 2620:10a:a001:a103::6:0
external_oam_floating_address: 2620:10A:A001:A103::11
external_oam_node_0_address: 2620:10A:A001:A103::8
external_oam_node_1_address: 2620:10A:A001:A103::9

# Controller-0 (standby) after ansible, unlocked, and swact to controller-1:
# Problem: Containerd is listening on the OAM controller address -> should be the node address 2620:10A:A001:A103::8
controller-0:~$ sudo netstat -tulpn6 | grep contain
tcp6 0 0 2620:10a:a001:a10:40870 :::* LISTEN 89515/containerd

controller-0:~$ sudo netstat -tulp6 | grep contain
tcp6 0 0 oamcontroller:40870 [::]:* LISTEN 89515/containerd

controller-0:~$ cat /etc/resolv.conf
nameserver face::1
nameserver 2620:10a:a001:a103::2

controller-0:~$ cat /etc/hosts | grep oamcontroller
2620:10a:a001:a103::11 oamcontroller

controller-0:~$ ip a
3: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 18:66:da:af:b5:88 brd ff:ff:ff:ff:ff:ff
    inet6 2620:10a:a001:a103::8/64 scope global <------- containerd should be listening here.
       valid_lft forever preferred_lft forever
    inet6 fe80::1a66:daff:feaf:b588/64 scope link
       valid_lft forever preferred_lft forever
14: vlan157@pxeboot0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc htb state UP group default qlen 1000
    link/ether 3c:fd:fe:25:d5:c0 brd ff:ff:ff:ff:ff:ff
    inet6 face::2/64 scope global
       valid_lft forever preferred_lft forever
    inet6 fe80::3efd:feff:fe25:d5c0/64 scope link
       valid_lft forever preferred_lft forever
15: vlan158@pxeboot0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 3c:fd:fe:25:d5:c0 brd ...

Read more...

Revision history for this message
Peng Peng (ppeng) wrote :

Issue was reproduced on
Lab: WCP_78_79
Load: 2020-04-29_20-00-00

https://files.starlingx.kube.cengn.ca/launchpad/1875891

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (master)

Fix proposed to branch: master
Review: https://review.opendev.org/725394

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/725394
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=0dc9e173855792c38bec90360c0c4c066c36d66b
Submitter: Zuul
Branch: master

commit 0dc9e173855792c38bec90360c0c4c066c36d66b
Author: Robert Church <email address hidden>
Date: Mon May 4 12:59:49 2020 -0400

    Ensure containerd binds to the loopback interface

    Set the stream_server_address to bind to the loopback interface with a
    value of "127.0.0.1" for IPv4 and "::1" for IPv6.

    This will explicitly update the containerd configuration to use the IP
    address of the loopback interface based on the system's network
    configuration.

    Change-Id: I76a4ad1c123b8b701cb1fa74b16609b50cdf9bd2
    Partial-Bug: #1875891
    Signed-off-by: Robert Church <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-puppet (master)

Reviewed: https://review.opendev.org/724384
Committed: https://git.openstack.org/cgit/starlingx/stx-puppet/commit/?id=b793518f65ae932f3974ff85b797f505b5ef1c2a
Submitter: Zuul
Branch: master

commit b793518f65ae932f3974ff85b797f505b5ef1c2a
Author: Robert Church <email address hidden>
Date: Wed Apr 29 12:49:04 2020 -0400

    Ensure containerd binds to the loopback interface

    Set the stream_server_address to bind to the loopback interface with a
    value of "127.0.0.1" for IPv4 and "::1" for IPv6.

    Without setting the stream_server_address in config.toml, containerd was
    binding to the OAM interface. Under most situations this resulted in
    containerd binding to the OAM fixed host address. But in an IPv6
    configuration there were occasions where after controller-0 unlock, the
    OAM floating IP would be used. When this happened, swacting away from
    controller-0 would move the OAM floating IP to controller-1 and break
    access to containers residing on controller-0.

    This will explicitly update the containerd configuration to use the IP
    address of the loopback interface based on the system's network
    configuration.

    This also removes any security concerns with containerd binding to the
    OAM interface.

    Change-Id: I0f914d738e94b525cf217712675d3b4575817d1d
    Depends-On: https://review.opendev.org/#/c/725394/
    Closes-Bug: #1875891
    Signed-off-by: Robert Church <email address hidden>

Revision history for this message
Yang Liu (yliu12) wrote :

This issue is no longer seen in sanity on 3 different system with 2020-05-05_20-29-49 load.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/729809

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/729825

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-puppet (f/centos8)
Download full text (16.7 KiB)

Reviewed: https://review.opendev.org/729825
Committed: https://git.openstack.org/cgit/starlingx/stx-puppet/commit/?id=d4617fbad74a05f2af81ee85a47565083991e6f8
Submitter: Zuul
Branch: f/centos8

commit 4134023ab84d8a635b118d5e3ff26ade3bbe535b
Author: Sharath Kumar K <email address hidden>
Date: Thu May 7 10:08:11 2020 +0200

    Tox and Zuul job for the bandit code scan in stx/stx-puppet

    Setting up the bandit tool for the scanning of HIGH severity issues
    in the python codes under Starlingx/stx-puppet folder.
    Expecting this merge will enable zuul job for CI/CD of bandit scan.

    Configuration files:
    1. tox.ini for adding bandit environment and command.
    2. test-requirements.txt for adding bandit version.
    3. .zuul.yaml file for adding bandit job and configuring under
       check job to run code scan every time before code commit.

    Test:
    Run tox -e bandit command inside the fault folder to validate the
    bandit scan and result.

    Story: 2007541
    Task: 39687
    Depends-On: https://review.opendev.org/#/c/721294/

    Change-Id: I2982268db2b5e75feeb287bc95420fedc9b0d816
    Signed-off-by: Sharath Kumar K <email address hidden>

commit 65daac29e4635f32a57e80cd18f96fd59dc8ebe0
Author: Bin Qian <email address hidden>
Date: Tue May 12 22:39:21 2020 -0400

    DC cert manifest should only apply to controller nodes

    DC cert manifest should only apply to controller nodes on system
    controller.
    This fix is for DC with worker nodes in central cloud.

    Change-Id: I4233509a6f0afb3013c01e81dea6f655d9e15371
    Closes-Bug: 1878260
    Signed-off-by: Bin Qian <email address hidden>

commit 04a3cb8cbad9b1700286c5de67aa5d974cf54400
Author: Elena Taivan <email address hidden>
Date: Wed Apr 29 08:44:13 2020 +0000

    Changing permissions for conversion folder

    Adding writing permissions to '/opt/conversion' mountpoint
    so openstack image conversion can happen there.

    Change-Id: Id1a91db6570dcbed3b8068e79e72f5bb800f24ad
    Partial-bug: 1819688
    Signed-off-by: Elena Taivan <email address hidden>

commit 4e9153cf234e714e4bbc9a9eb3d9b55b2828145a
Author: Tao Liu <email address hidden>
Date: Mon May 4 14:30:30 2020 -0500

    Move subcloud audit to separate process

    Subcloud audit is being removed from the dcmanager-manager
    process and it is running in dcmanager-audit process.

    This update adds associated puppet config.

    Story: 2007267
    Task: 39640
    Depends-On: https://review.opendev.org/#/c/725627/

    Change-Id: Idd2e675126a01d6113597646ddd9eb4a0bc5be44
    Signed-off-by: Tao Liu <email address hidden>

commit b793518f65ae932f3974ff85b797f505b5ef1c2a
Author: Robert Church <email address hidden>
Date: Wed Apr 29 12:49:04 2020 -0400

    Ensure containerd binds to the loopback interface

    Set the stream_server_address to bind to the loopback interface with a
    value of "127.0.0.1" for IPv4 and "::1" for IPv6.

    Without setting the stream_server_address in config.toml, containerd was
    binding to the OAM interface. Under most situations this resulted in
    containe...

tags: added: in-f-centos8
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (f/centos8)
Download full text (22.6 KiB)

Reviewed: https://review.opendev.org/729809
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=73027425d4501a6b7785e91024c9e8ddbc03115d
Submitter: Zuul
Branch: f/centos8

commit 55c9afd075194f7669fa2a87e546f61034679b04
Author: Dan Voiculeasa <email address hidden>
Date: Wed May 13 14:19:52 2020 +0300

    Restore: disconnect etcd from ceph

    At the moment etcd is restored only if ceph data is kept.
    Etcd should be restored regardless if ceph data is kept or wiped.

    Story: 2006770
    Task 39751
    Change-Id: I9dfb1be0a83c3fdc5f1b29cbb974c5e0e2236ad3
    Signed-off-by: Dan Voiculeasa <email address hidden>

commit 003ddff574c74adf11cf8e4758e93ba0eed45a6a
Author: Don Penney <email address hidden>
Date: Fri May 8 11:35:58 2020 -0400

    Add playbook for updating static images

    This commit introduces a new playbook, upgrade-static-images.yml, used
    for downloading updating images and pushing to the local registry.

    Change-Id: I8884440261a5a4e27b40398e5a75c9d03b09d4ba
    Story: 2006781
    Task: 39706
    Signed-off-by: Don Penney <email address hidden>

commit 26fd273cf5175ba4bdd31d6b6b777814f1a6c860
Author: Matt Peters <email address hidden>
Date: Thu May 7 14:29:02 2020 -0500

    Add kube-apiserver port to calico failsafe rules

    An invalid GlobalNetworkPolicy or NetworkPolicy may prevent
    calico-node from communicating with the kube-apiserver.
    Once the communication is broken, calico-node is no longer
    able to update the policies since it cannot communicate to
    read the updated policies. It can also prevent the pod
    from starting since the policies will prevent it from
    reading the configuration.

    To ensure that this scenario does not happen, the kube-apiserver
    port is being added to the failsafe rules to ensure communication
    is always possible, regardless of the network policy configuration.

    Change-Id: I1b065a74e7ad0ba9b1fdba4b63136b97efbe98ce
    Closes-Bug: 1877166
    Related-Bug: 1877383
    Signed-off-by: Matt Peters <email address hidden>

commit bd0f14a7dfb206ccaa3ce0f5e7d9034703b3403c
Author: Robert Church <email address hidden>
Date: Tue May 5 15:11:15 2020 -0400

    Provide an update strategy for Tiller deployment

    In the case of a simplex controller configuration the current patching
    strategy for the Tiller environment will fail as the tiller ports will
    be in use when the new deployment is attempted to be applied. The
    resulting tiller pod will be stuck in a Pending state.

    This will be observed if the node becomes ready after 'helm init'
    installs the initial deployment and before the deployment is patched for
    environment checks.

    The deployment strategy provided by 'helm init' is unspecified. This
    change will allow one additional pod (current + new) and one unavailable
    pod (current) during an update. The maxUnavailable setting allows the
    tiller pod to be deleted which will release its ports, thus allowing the
    patch deployment to spin up an new pod to a Running state.

    Change-Id: I83c43c52a77...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.