[Containers] Pods not running on computes (Standard 2+2)

Bug #1817723 reported by Jose Perez Carranza
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Mingyuan Qi

Bug Description

Title
-----
Some pods are not correctly running on computes on Satndard Non-Storage configuration.

Brief Description
-----------------
After finishing the provisioning and apply application some pods are not running properly on compute nodes, they stay on "ContainerCreating", "Init" or "Pending" status

Severity
--------
Critical

Steps to Reproduce
------------------
1. Complete the instalation of a Standard Non Storage configuration as described on:
   - https://wiki.openstack.org/wiki/StarlingX/Containers/InstallationOnStandard
2. Execute Below command to verify that all the pods are running or completed
   $ kubectl get pods --all-namespaces -o wide

Expected Behavior
------------------
All Pods should be Running or Completed

Actual Behavior
----------------
There are some pods not Running neither Completed

controller-0:~$ kubectl get pods --all-namespaces -o wide |grep -v -e "Running" -e "Completed"
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE
kube-system calico-node-4bppq 0/2 ContainerCreating 0 6h39m 192.168.204.173 compute-1 <none>
kube-system calico-node-4gtcs 0/2 ContainerCreating 0 6h44m 192.168.204.245 compute-0 <none>
kube-system kube-proxy-jfjv7 0/1 ContainerCreating 0 6h39m 192.168.204.173 compute-1 <none>
kube-system kube-proxy-tw264 0/1 ContainerCreating 0 6h44m 192.168.204.245 compute-0 <none>
openstack nova-cell-setup-z2fgz 0/1 Init:0/2 0 142m 172.16.0.43 controller-0 <none>
openstack osh-openstack-garbd-garbd-5d495764d9-6xlrd 0/1 Pending 0 153m <none> <none> <none>

Reproducibility
---------------
Reproducible

System Configuration
--------------------
Configuration: Standard Non Storage
Environment: Virtual and BM

Branch/Pull Time/Commit
-----------------------
Branch: Master
ISO: http://mirror.starlingx.cengn.ca/mirror/starlingx/master/centos/20190223T060000Z/outputs/iso/bootimage.iso

Timestamp/Logs
--------------
http://paste.openstack.org/show/746163/

Revision history for this message
Jose Perez Carranza (jgperezc) wrote :
Ada Cabrales (acabrale)
tags: added: stx.containers
Revision history for this message
Cristopher Lemus (cjlemusc) wrote :

Tried using ISO from stein branch (2019-02-25):

[wrsroot@controller-1 ~(keystone_admin)]$ cat /etc/build.info
###
### StarlingX
### Release 19.01
###

OS="centos"
SW_VERSION="19.01"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="f/stein"

JOB="STX_build_stein_master"
<email address hidden>"
BUILD_NUMBER="54"
BUILD_HOST="starlingx_mirror"
BUILD_DATE="2019-02-25 19:13:50 +0000"

The same pods are not properly running:

[wrsroot@controller-1 ~(keystone_admin)]$ kubectl get pods --all-namespaces -o wide |egrep -vi "running|completed"
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE
kube-system calico-node-9hfr4 0/2 ContainerCreating 0 163m 10.10.54.100 compute-1 <none>
kube-system calico-node-dqbzh 0/2 ContainerCreating 0 164m 10.10.54.192 compute-0 <none>
kube-system kube-proxy-krmkz 0/1 ContainerCreating 0 164m 10.10.54.192 compute-0 <none>
kube-system kube-proxy-tkxwf 0/1 ContainerCreating 0 163m 10.10.54.100 compute-1 <none>
openstack heat-engine-cleaner-1551185700-7c66t 0/1 PodInitializing 0 5s 172.16.0.67 controller-0 <none>
openstack nova-cell-setup-bl4br 0/1 Init:0/2 0 42m 172.16.1.49 controller-1 <none>
openstack osh-openstack-garbd-garbd-74d7ff4f-t4gsw 0/1 Pending 0 88m <none> <none> <none>

Revision history for this message
Bob Church (rchurch) wrote :

We are not seeing this is a non-proxy environment. Can you gather some more information from these pods?

See some debugging tips: https://wiki.openstack.org/wiki/StarlingX/Containers/FAQ#What_should_I_do_if_I_see_a_pod_is_not_in_a_Running_state

Basically you want to use "kubectl describe" at this point and check the events.

It seems like those pods are having trouble pulling their images or perhaps there is a long delay in getting the docker images for those containers.

Revision history for this message
Cristopher Lemus (cjlemusc) wrote :
Revision history for this message
Fernando Hernandez Gonzalez (fhernan2) wrote :
Download full text (4.3 KiB)

Same behavior on 2 + 2 + 2

--------------------------------------------------------------------------------------------
controller-0:~$ kubectl get pods --all-namespaces -o wide |grep -v -e "Running" -e "Completed"
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE
kube-system calico-node-jd2gd 0/2 ContainerCreating 0 3h51m 192.168.204.172 compute-0 <none>
kube-system calico-node-nlt9h 0/2 ContainerCreating 0 3h50m 192.168.204.133 compute-1 <none>
kube-system kube-proxy-7gkrj 0/1 ContainerCreating 0 3h50m 192.168.204.133 compute-1 <none>
kube-system kube-proxy-fpk2z 0/1 ContainerCreating 0 3h51m 192.168.204.172 compute-0 <none>
openstack nova-cell-setup-krck7 0/1 Init:0/2 0 158m 172.16.0.34 controller-0 <none>
openstack osh-openstack-garbd-garbd-5d495764d9-zrddq 0/1 Pending 0 3h22m <none> <none> <none>
--------------------------------------------------------------------------------------------
controller-0:~$ kubectl describe pods -n openstack osh-openstack-garbd-garbd-5d495764d9-zrddq
Name: osh-openstack-garbd-garbd-5d495764d9-zrddq
Namespace: openstack
Priority: 0
PriorityClassName: <none>
Node: <none>
Labels: application=garbd
                    component=server
                    pod-template-hash=5d495764d9
                    release_group=osh-openstack-garbd
Annotations: configmap-bin-hash: e9eadebcdb0d47224b47dfa165edeca8ac773b0939b1672ee5d4a438c2341f5a
Status: Pending
IP:
Controlled By: ReplicaSet/osh-openstack-garbd-garbd-5d495764d9
Init Containers:
  init:
    Image: 192.168.204.2:9001/quay.io/stackanetes/kubernetes-entrypoint:v0.3.1
    Port: <none>
    Host Port: <none>
    Command:
      kubernetes-entrypoint
    Environment:
      POD_NAME: osh-openstack-garbd-garbd-5d495764d9-zrddq (v1:metadata.name)
      NAMESPACE: openstack (v1:metadata.namespace)
      INTERFACE_NAME: eth0
      PATH: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/
      DEPENDENCY_SERVICE:
      DEPENDENCY_DAEMONSET:
      DEPENDENCY_CONTAINER:
      DEPENDENCY_POD_JSON:
      COMMAND: echo done
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from osh-openstack-garbd-garbd-token-8x7p6 (ro)
Containers:
  garbd:
    Image: 192.168.204.2:9001/starlingx/stx-mariadb:dev-centos-pike-latest
    Port: <none>
    Host Port: <none>
    Command:
      /tmp/garbd.sh
    Environment:
      GROUP_NAME: mariadb-server_openstack
      GROUP_ADDRESS: gcomm://mariadb-server-0.mariadb-discovery.openst...

Read more...

Cindy Xie (xxie1)
Changed in starlingx:
assignee: nobody → Austin Sun (sunausti)
Mingyuan Qi (myqi)
Changed in starlingx:
assignee: Austin Sun (sunausti) → Mingyuan Qi (myqi)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-config (master)

Fix proposed to branch: master
Review: https://review.openstack.org/639619

Changed in starlingx:
status: New → In Progress
Frank Miller (sensfan22)
Changed in starlingx:
importance: Undecided → High
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as release gating; high priority as it affects configurations where a proxy is used

summary: - [Containers] Pods not running on computes (Standar 2+2)
+ [Containers] Pods not running on computes (Standard 2+2)
tags: added: stx.2019.05
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-config (master)

Reviewed: https://review.openstack.org/639619
Committed: https://git.openstack.org/cgit/openstack/stx-config/commit/?id=1004ae82fd0290e5040542f05f851fc204956973
Submitter: Zuul
Branch: master

commit 1004ae82fd0290e5040542f05f851fc204956973
Author: Mingyuan Qi <email address hidden>
Date: Wed Feb 27 17:45:02 2019 +0800

    Fix k8s firewall blocks docker proxy port

    Allow docker proxy port(s) in k8s firewall. It unblocks worker
    node to access proxy via controller. As a result, worker node can
    successfully pull k8s/calico images through proxy.

    Duplex: docker proxy port correctly added in iptables SNAT rule
    2+2: docker proxy port correctly added in iptables SNAT rule
    2+2 without proxy: pass, no regression issue

    Closes-Bug: #1817723
    Change-Id: I7a8093a1fdce0089e5d0a9483a5c58184d1e213e
    Signed-off-by: Mingyuan Qi <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Ken Young (kenyis)
tags: added: stx.2.0
removed: stx.2019.05
Revision history for this message
Fernando Hernandez Gonzalez (fhernan2) wrote :
Download full text (5.5 KiB)

I believe we are having problems with pods not running like LP "https://bugs.launchpad.net/starlingx/+bug/1817723" but this time in 2+2+2 cluster.

[wrsroot@controller-0 ~(keystone_admin)]$ kubectl get pods --all-namespaces | grep 0/1
kube-system calico-node-5kw4w 0/1 Init:0/2 0 6h4m
kube-system calico-node-whrbc 0/1 Init:0/2 0 6h4m
kube-system ceph-pools-audit-1558479000-zcmmr 0/1 Completed 0 11m
kube-system ceph-pools-audit-1558479300-dxj65 0/1 Completed 0 6m10s
kube-system ceph-pools-audit-1558479600-cm9rs 0/1 Completed 0 69s
kube-system kube-multus-ds-amd64-rglr7 0/1 ContainerCreating 0 6h4m
kube-system kube-multus-ds-amd64-wsqvv 0/1 ContainerCreating 0 6h4m
kube-system kube-proxy-kh8rs 0/1 ContainerCreating 0 6h4m
kube-system kube-proxy-pcxgc 0/1 ContainerCreating 0 6h4m
openstack osh-openstack-garbd-garbd-85564795f6-9sj6v 0/1 Pending 0 5h56m

[wrsroot@controller-0 ~(keystone_admin)]$ kubectl get pods -n openstack
NAME READY STATUS RESTARTS AGE
ingress-7bf7c8458f-dtr7t 1/1 Running 0 5h59m
ingress-7bf7c8458f-grw8h 1/1 Running 0 5h59m
ingress-error-pages-cf8cf7ccd-lw8sr 1/1 Running 0 5h59m
ingress-error-pages-cf8cf7ccd-s6hfg 1/1 Running 0 5h59m
mariadb-ingress-66c7f9964b-d4t9p 1/1 Running 0 5h59m
mariadb-ingress-66c7f9964b-jnhws 1/1 Running 0 5h59m
mariadb-ingress-error-pages-749cf64f44-dcr5h 1/1 Running 0 5h59m
mariadb-server-0 1/1 Running 0 5h59m
mariadb-server-1 1/1 Running 0 5h59m
osh-openstack-garbd-garbd-85564795f6-9sj6v 0/1 Pending 0 5h57m

[wrsroot@controller-0 ~(keystone_admin)]$ kubectl describe pod osh-openstack-garbd-garbd-85564795f6-9sj6v -n openstack
Name: osh-openstack-garbd-garbd-85564795f6-9sj6v
Namespace: openstack
Priority: 0
PriorityClassName: <none>
Node: <none>
Labels: application=garbd
                    component=server
                    pod-template-hash=85564795f6
                    release_group=osh-openstack-garbd
Annotations: configmap-bin-hash: e9eadebcdb0d47224b47dfa165edeca8ac773b0939b1672ee5d4a438c2341f5a
Status: Pending
IP:
Controlled By: ReplicaSet/osh-openstack-garbd-garbd-85564795f6
Init Containers:
  init:
    Image: 10.10.58.2:9001/quay.io/stackanetes/kubernetes-entrypoint:v0.3.1
    Port: <none>
    Host Port: <none>
    Command:
      kubernetes-entrypoint
    Environment:
      POD_NAME: osh-openstack-garbd-garbd-85564...

Read more...

Revision history for this message
Mingyuan Qi (myqi) wrote :

@Fernando I don't think it's the same issue as the previous one. the log didn't show the containers not running was because of the image pulling issue. I think you could file another LP and add more detail logs like "kubectl describe node" and describe all the failed pods within kube-system namespace.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.