platform-integ-apps apply failed - tiller on Crash

Bug #1851533 reported by Cristopher Lemus
14
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Critical
Bob Church

Bug Description

Brief Description
-----------------
During the initial setup of Standard with Dedicated Storage configuration, application platform-integ-apps went to status apply-failed.

Severity
--------
Critical.

Steps to Reproduce
------------------
Follow up starlingx documentation to do a setup of Standard with Dedicated Storage configuration.

Expected Behavior
------------------
platform-integ-apps on applied status.

Actual Behavior
----------------
platform-integ-apps on apply-failed status.

Reproducibility
---------------
Updated to 50% reproducible.

System Configuration
--------------------
Baremetal
Standard with Dedicated storage (2+2+2) and Simplex
Virtual
Standard with Dedicated storage (2+2+2)

Branch/Pull Time/Commit
-----------------------
BUILD_ID="20191106T023000Z"

Last Pass
---------
This stage passed one build before: 20191105T000000Z

Timestamp/Logs
--------------
Full collect attached.

Some outputs that point to tiller-deploy pod: http://paste.openstack.org/show/785858/

Test Activity
-------------
Sanity

Revision history for this message
Cristopher Lemus (cjlemusc) wrote :
Revision history for this message
Cristopher Lemus (cjlemusc) wrote :

With BUILD_ID="20191107T023000Z", this issue replicated on Virtual environment only, for baremetal, it did not replicated. Lowering percentage to 50%. Uploading full collect.

NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system tiller-deploy-d6b59fcb-5xpb9 0/1 CrashLoopBackOff 101 8h

description: updated
Ghada Khalil (gkhalil)
tags: added: stx.containers
Revision history for this message
Cristopher Lemus (cjlemusc) wrote :

This behavior has replicated, now on Simplex (all-in-one) configuration.

BUILD_ID="20191111T000000Z"

NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system tiller-deploy-d6b59fcb-ttx8n 0/1 CrashLoopBackOff 79 6h37m

Full collect attached.

Revision history for this message
Bruce Jones (brucej) wrote :

Setting this to Critical since it's blocking Sanity. Asking Cindy to have someone take a look asap.

Changed in starlingx:
importance: Undecided → Critical
assignee: nobody → Cindy Xie (xxie1)
Revision history for this message
Cristopher Lemus (cjlemusc) wrote :

Just to clarify, with BUILD_ID= 20191111T000000Z, it replicated on Simplex Baremetal configuration.

description: updated
marvin Yu (marvin-yu)
Changed in starlingx:
status: New → Incomplete
status: Incomplete → New
Revision history for this message
Bob Church (rchurch) wrote :

Looks like tiller fails because 'helm init' hard-codes the tiller grpc port to 44134, and http port to 44135. In the logs from (controller-0_20191111.095417.tar) port 44134 is in use by the connection between ceph mon<->mgr.

Tiller complains:
var/log/pods/kube-system_tiller-deploy-d6b59fcb-ttx8n_b861e13f-6a4d-4e58-8d64-1939fa71bec4/tiller/80.log:{"log":"[main] 2019/11/11 09:54:12 Server died: listen tcp :44134: bind: address already in use\n","stream":"stderr","time":"2019-11-11T09:54:12.883393176Z"}

Established connection and port marked unreachable
var/log/sm-troubleshoot.log:tcp 0 0 192.168.204.2:44134 192.168.204.2:6789 ESTABLISHED 109328/ceph-mgr off (0.00/0/0)
var/log/sm-troubleshoot.log:tcp 0 0 192.168.204.2:6789 192.168.204.2:44134 ESTABLISHED 98354/ceph-mon off (0.00/0/0)
var/extra/iptables.dump:-A KUBE-SERVICES -d 10.100.253.239/32 -p tcp -m comment --comment "kube-system/tiller-deploy:tiller has no endpoints" -m tcp --dport 44134 -j REJECT --reject-with icmp-port-unreachable

AND from ALL_NODES_20191107.152052.tar the port in in used between kube-apiserver and etcd.

var/log/pods/kube-system_tiller-deploy-d6b59fcb-5xpb9_efbe60e4-5ad6-49ce-9056-94d859321d93/tiller/102.log:{"log":"[main] 2019/11/07 15:21:18 Server died: listen tcp :44134: bind: address already in use\n","stream":"stderr","time":"2019-11-07T15:21:18.920805438Z"}

var/log/sm-troubleshoot.log:tcp 0 0 192.168.206.2:44134 192.168.206.1:2379 ESTABLISHED 92628/kube-apiserve keepalive (9.00/0/0)
var/log/sm-troubleshoot.log:tcp6 0 0 192.168.206.1:2379 192.168.206.2:44134 ESTABLISHED 94848/etcd keepalive (9.06/0/0)
var/extra/iptables.dump:-A KUBE-SERVICES -d 10.110.249.81/32 -p tcp -m comment --comment "kube-system/tiller-deploy:tiller has no endpoints" -m tcp --dport 44134 -j REJECT --reject-with icmp-port-unreachable

We are going to need to reserve these ports for tiller or see if we can set the ports to a non-ephemeral port number

Revision history for this message
Austin Sun (sunausti) wrote :

if revert https://review.opendev.org/#/c/691714 and https://review.opendev.org/#/c/692439 , this issue is gone,
need owner to double check this issue.

Cindy Xie (xxie1)
Changed in starlingx:
assignee: Cindy Xie (xxie1) → zhipeng liu (zhipengs)
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Subscribed Bin Qian to this LP since he's the author of the commits mentioned above.

Revision history for this message
Bin Qian (bqian20) wrote :

The commit https://opendev.org/starlingx/config-files/commit/fbc09b8db8a14fbf24976a0f7d8924af7a330f85 removed the settings of :
# Limit local port range
net.ipv4.ip_local_port_range = 49216 61000
net.ipv4.tcp_tw_reuse = 1
causes the local port range set to default 32768 to 60999, in which case target port 44134 is randomly assigned to a client port.
The commit in question should be reviewed.

Revision history for this message
yong hu (yhu6) wrote :

@Austin is investigating this issue.

Changed in starlingx:
assignee: zhipeng liu (zhipengs) → Austin Sun (sunausti)
Revision history for this message
Austin Sun (sunausti) wrote :

The commit fbc09b8db8a14fbf24976a0f7d8924af7a330f85 was merged two months ago.
LP# 1851533 was reported since 11/05, and never reported before 11/05.

GDC team test 3 times for reverting
https://review.opendev.org/#/c/691714 and https://review.opendev.org/#/c/692439

this issue did not reproduce.

Revision history for this message
Frank Miller (sensfan22) wrote :

Bob's investigated the port issue and it is discussed here: https://github.com/helm/helm/issues/5564
When helm is upversioned to helm 3 these ports will no longer be required. But for stx.3.0 a temporary solution is needed.

Re-assigning to Bob to implement a solution where the tiller ports are not ephemeral.

Changed in starlingx:
assignee: Austin Sun (sunausti) → Bob Church (rchurch)
Ghada Khalil (gkhalil)
Changed in starlingx:
status: New → Triaged
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (master)

Fix proposed to branch: master
Review: https://review.opendev.org/694355

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-puppet (master)

Reviewed: https://review.opendev.org/694355
Committed: https://git.openstack.org/cgit/starlingx/stx-puppet/commit/?id=a713f9567d212142ec1d7f69c1f4d126a8d5475c
Submitter: Zuul
Branch: master

commit a713f9567d212142ec1d7f69c1f4d126a8d5475c
Author: Robert Church <email address hidden>
Date: Thu Nov 14 09:04:42 2019 -0500

    Reserve ports in the ephemeral port range

    Set ip_local_reserved_ports for keystone and tiller

    Per https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt,
    this variable will:
    - Specify the ports which are reserved for known third-party
      applications.
    - Note that ip_local_port_range and ip_local_reserved_ports settings are
      independent and both are considered by the kernel when determining
      which ports are available for automatic port assignments.

    This results in the following on controllers:

    $ cat /proc/sys/net/ipv4/ip_local_port_range
    32768 60999

    $ cat /proc/sys/net/ipv4/ip_local_reserved_ports
    35357,44134-44136

    Change-Id: I59219dc1e6b834e105be55e1e863b8f82fe50816
    Closes-Bug: #1851533
    Signed-off-by: Robert Church <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to config-files (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/715095

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to config-files (master)

Reviewed: https://review.opendev.org/715095
Committed: https://git.openstack.org/cgit/starlingx/config-files/commit/?id=de8d65efdf298d23ad690fb0b97d209cc95e9354
Submitter: Zuul
Branch: master

commit de8d65efdf298d23ad690fb0b97d209cc95e9354
Author: Robert Church <email address hidden>
Date: Wed Mar 25 17:19:57 2020 -0400

    Reserve ephemeral ports that are expected by system services

    Update sysctl.conf to reserve keystone and tiller ports so that any
    initial system processes do not claim these ports.

    These are also reserved in puppet and part of initial system
    provisioning.

    Change-Id: I3bae661348718df00f7b50ba15931281a744d473
    Closes-Bug: #1869011
    Related-Bug: #1851533
    Signed-off-by: Robert Church <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to config-files (f/centos8)

Related fix proposed to branch: f/centos8
Review: https://review.opendev.org/716138

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to config-files (f/centos8)

Reviewed: https://review.opendev.org/716138
Committed: https://git.openstack.org/cgit/starlingx/config-files/commit/?id=77460a9893ddbec82cf2a370e2434d5970b556f9
Submitter: Zuul
Branch: f/centos8

commit de8d65efdf298d23ad690fb0b97d209cc95e9354
Author: Robert Church <email address hidden>
Date: Wed Mar 25 17:19:57 2020 -0400

    Reserve ephemeral ports that are expected by system services

    Update sysctl.conf to reserve keystone and tiller ports so that any
    initial system processes do not claim these ports.

    These are also reserved in puppet and part of initial system
    provisioning.

    Change-Id: I3bae661348718df00f7b50ba15931281a744d473
    Closes-Bug: #1869011
    Related-Bug: #1851533
    Signed-off-by: Robert Church <email address hidden>

commit b95127d6800612776adbb4307bc97a7a14105762
Author: Jessica Castelino <email address hidden>
Date: Fri Mar 6 16:27:28 2020 -0500

    Log rotation for Distributed Cloud

    Implemented log rotation for dcdbsync.log and increased the size of
    dcorch.log to 20M

    Change-Id: I29f701fa0d4701820f6409a08478bf2d84e4dc10
    Story: 2007267
    Task: 38978
    Partial-Bug: 1857069
    Signed-off-by: Jessica Castelino <email address hidden>

commit aecd17c5e3e928d84c7ac14f247bab2fbee5b6d5
Author: Bin Qian <email address hidden>
Date: Wed Feb 5 14:17:37 2020 -0500

    Adding job to upload commits to GitHub

    Add job to publish config-files repo to GitHub

    Change-Id: I5e08200ed748e080f2629ac5c1af05d8fddbb497
    Story: 2007252
    Task: 38665
    Signed-off-by: Bin Qian <email address hidden>

tags: added: in-f-centos8
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Bob/Frank, This LP is marked as gating for stx.3.0. Please cherry-pick the code changes to the stx.3.0 branch if applicable or add a note explaining why it shouldn't be cherry-picked.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.