Containers: worker nodes gets stuck at ContainerCreating for some time due to no network connectivity to floating IP

Bug #1817763 reported by Yang Liu
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Invalid
Medium
Joseph Richard

Bug Description

Brief Description
-----------------
After install and unlock worker nodes, they stuck at ContainerCreating for extended amount of time (> 40m) due to they failed to pull images from external repo

Severity
--------
Minor

Steps to Reproduce
------------------
- Install and configure controller-0
- Install controller-1 and worker nodes from controller-0, and unlock them

Expected Behavior
------------------
- Worker nodes should pull images from internal registry and in Ready states shortly after unlock completes

Actual Behavior
----------------
- worker nodes were trying to pull images from external repo and got stuck at NotReady - ContainerCreating for 40 minutes plus

Reproducibility
---------------
Intermittent

System Configuration
--------------------
Multi-node system

Branch/Pull Time/Commit
-----------------------
f/stein as of 2019-02-25

Timestamp/Logs
--------------
NAME STATUS ROLES AGE VERSION
compute-0 NotReady <none> 45m v1.12.3
compute-1 NotReady <none> 44m v1.12.3
controller-0 Ready master 117m v1.12.3
controller-1 Ready master 65m v1.12.3

[wrsroot@controller-0 ~(keystone_admin)]$ kubectl get pods --all-namespaces -o wide | grep -v -e Completed -e Running
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE
kube-system calico-node-m6znx 0/2 ContainerCreating 0 41m 192.168.204.91 compute-0 <none>
kube-system calico-node-w9nlk 0/2 ContainerCreating 0 40m 192.168.204.185 compute-1 <none>
kube-system kube-proxy-66j88 0/1 ContainerCreating 0 40m 192.168.204.185 compute-1 <none>
kube-system kube-proxy-86jn4 0/1 ContainerCreating 0 41m 192.168.204.91 compute-0 <none>

Events:
  Type Reason Age From Message
  ---- ------ ---- ---- -------
  Normal Scheduled 43m default-scheduler Successfully assigned kube-system/calico-node-m6znx to compute-0
  Warning FailedCreatePodSandBox 2m30s (x87 over 42m) kubelet, compute-0 Failed create pod sandbox: rpc error: code = Unknown desc = failed pulling image "k8s.gcr.io/pause:3.1": Error response from daemon: Get https://k8s.gcr.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

Tags: stx.2.0
Ghada Khalil (gkhalil)
tags: added: stx.containers
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as release gating; medium priority as the timeout is intermittent. This should be used to make the container download more robust and less dependent on the external registry.

Changed in starlingx:
assignee: nobody → Angie Wang (angiewang)
importance: Undecided → Medium
status: New → Triaged
tags: added: stx.2019.05
Revision history for this message
Frank Miller (sensfan22) wrote :

Update on this LP: Actual issue is network conenctivity. Title of LP is updated.

Seen on 2.27 by Yang:
I’m seeing the computes issue again on another lab.

Also when this happens:
- floating ip for this system is not reachable, while unit ip is okay
- once floating ip becomes reachable, the containers were created on computes

summary: Containers: worker nodes gets stuck at ContainerCreating for some time
- due to failed to pull images from external repo
+ due to no network connectivity to floating IP
Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: Angie Wang (angiewang) → Joseph Richard (josephrichard)
Ken Young (kenyis)
tags: added: stx.2.0
removed: stx.2019.05
Ghada Khalil (gkhalil)
tags: added: stx.retestneeded
Revision history for this message
Numan Waheed (nwaheed) wrote :

This issue is seen consistently in IP 1-4 lab. Install of IP 1-4 fails by Floating IP address not reachable. Although the user can ping it and nslookup shows the IP address assigned to the right node.

Additionally, on this lab, even though ping succeeded, it got redirected couple times:

yliu12@yow-cgts1-lx$ping 128.224.151.212
PING 128.224.151.212 (128.224.151.212) 56(84) bytes of data.
64 bytes from 128.224.151.212: icmp_seq=1 ttl=63 time=0.512 ms
From 128.224.144.1: icmp_seq=2 Redirect Host(New nexthop: 128.224.144.75)
From 128.224.144.1 icmp_seq=2 Redirect Host(New nexthop: 128.224.144.75)

# Normally when floating ip works correctly, we will see something like this:

yliu12@yow-cgts1-lx$ping 128.224.151.212
PING 128.224.151.212 (128.224.151.212) 56(84) bytes of data.
64 bytes from 128.224.151.212: icmp_seq=1 ttl=63 time=11.2 ms

Revision history for this message
Joseph Richard (josephrichard) wrote :

That issue in IP 1-4 is a lab configuration issue, and not related to the original issue in this bug report.

Revision history for this message
Joseph Richard (josephrichard) wrote :

When is the last time that this behaviour has been observed?

Frank Miller (sensfan22)
tags: removed: stx.containers
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Based on review with Matt Peters (networking TL) and Yang Liu (reporter), it was concluded that this is an intermittent lab issue in the WR labs which results in loss of connectivity to the floating IP. As a workaround, the lab installer sends a garp after the initial controller install to workaround this.

Closing as no software fix is required for this.

Changed in starlingx:
status: Triaged → Invalid
Revision history for this message
Yang Liu (yliu12) wrote :

Lab installer was updated to send gard when floating ip issue is encountered.

tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.