kube-dns pod not operationally present in agent

Bug #1694317 reported by Vedamurthy Joshi
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R4.0
Fix Committed
High
Praveen
Trunk
Fix Committed
High
Praveen

Bug Description

R4.0 Build 14 Ubuntu 16.04.2 Container setup

While working with the k8s cluster, it so happened that agent did not the vmi of kube-dns vm object. Agent ifmap db had the VM details, but operationally, it was not up.

From the cni logs, it looks like the vmi for the pod was deleted altogether

root@nodec1:~# kubectl describe pod kube-dns --namespace=kube-system
Name: kube-dns-3121064917-wk1rd
Namespace: kube-system
Node: nodek1/10.204.216.221
Start Time: Mon, 29 May 2017 15:03:43 +0530
Labels: k8s-app=kube-dns
  pod-template-hash=3121064917
Annotations: kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicaSet","namespace":"kube-system","name":"kube-dns-3121064917","uid":"6e9f79da-442c-11e7-96b4-002590c3...
  scheduler.alpha.kubernetes.io/critical-pod=
Status: Running
IP:
Controllers: ReplicaSet/kube-dns-3121064917
Containers:
  kubedns:
    Container ID: docker://8c7fff159158436b4c756e599458556c27e0be3b02f1992a0530c509dae8e52a
    Image: gcr.io/google_containers/k8s-dns-kube-dns-amd64:1.14.1
    Image ID: docker-pullable://gcr.io/google_containers/k8s-dns-kube-dns-amd64@sha256:33914315e600dfb756e550828307dfa2b21fb6db24fe3fe495e33d1022f9245d
    Ports: 10053/UDP, 10053/TCP, 10055/TCP
    Args:
      --domain=cluster.local.
      --dns-port=10053
      --config-dir=/kube-dns-config
      --v=2
    State: Running
      Started: Mon, 29 May 2017 15:40:27 +0530
    Ready: True
    Restart Count: 0
    Limits:
      memory: 170Mi
    Requests:
      cpu: 100m
      memory: 70Mi
    Liveness: exec [ping -c 1 127.0.0.1] delay=0s timeout=1s period=10s #success=1 #failure=3
    Readiness: exec [ping -c 1 127.0.0.1] delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      PROMETHEUS_PORT: 10055
    Mounts:
      /kube-dns-config from kube-dns-config (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-dns-token-4p5rc (ro)
  dnsmasq:
    Container ID: docker://00db2adb01c22c7299d2f1ffa675c07604ce267c9106af31977c995475cacd3c
    Image: gcr.io/google_containers/k8s-dns-dnsmasq-nanny-amd64:1.14.1
    Image ID: docker-pullable://gcr.io/google_containers/k8s-dns-dnsmasq-nanny-amd64@sha256:89c9a1d3cfbf370a9c1a949f39f92c1dc2dbe8c3e6cc1802b7f2b48e4dfe9a9e
    Ports: 53/UDP, 53/TCP
    Args:
      -v=2
      -logtostderr
      -configDir=/etc/k8s/dns/dnsmasq-nanny
      -restartDnsmasq=true
      --
      -k
      --cache-size=1000
      --log-facility=-
      --server=/cluster.local/127.0.0.1#10053
      --server=/in-addr.arpa/127.0.0.1#10053
      --server=/ip6.arpa/127.0.0.1#10053
    State: Running
      Started: Mon, 29 May 2017 15:40:36 +0530
    Ready: True
    Restart Count: 0
    Requests:
      cpu: 150m
      memory: 20Mi
    Liveness: exec [ping -c 1 127.0.0.1] delay=0s timeout=1s period=10s #success=1 #failure=3
    Readiness: exec [ping -c 1 127.0.0.1] delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment: <none>
    Mounts:
      /etc/k8s/dns/dnsmasq-nanny from kube-dns-config (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-dns-token-4p5rc (ro)
  sidecar:
    Container ID: docker://5d085569216c02784bba10412f25e77d6153b180ea252512d4aa248705adc816
    Image: gcr.io/google_containers/k8s-dns-sidecar-amd64:1.14.1
    Image ID: docker-pullable://gcr.io/google_containers/k8s-dns-sidecar-amd64@sha256:d33a91a5d65c223f410891001cd379ac734d036429e033865d700a4176e944b0
    Port: 10054/TCP
    Args:
      --v=2
      --logtostderr
      --probe=kubedns,127.0.0.1:10053,kubernetes.default.svc.cluster.local,5,A
      --probe=dnsmasq,127.0.0.1:53,kubernetes.default.svc.cluster.local,5,A
    State: Running
      Started: Mon, 29 May 2017 15:40:44 +0530
    Ready: True
    Restart Count: 0
    Requests:
      cpu: 10m
      memory: 20Mi
    Liveness: exec [ls] delay=0s timeout=1s period=10s #success=1 #failure=3
    Readiness: exec [ls] delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment: <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-dns-token-4p5rc (ro)
Conditions:
  Type Status
  Initialized True
  Ready True
  PodScheduled True
Volumes:
  kube-dns-config:
    Type: ConfigMap (a volume populated by a ConfigMap)
    Name: kube-dns
    Optional: true
  kube-dns-token-4p5rc:
    Type: Secret (a volume populated by a Secret)
    SecretName: kube-dns-token-4p5rc
    Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.alpha.kubernetes.io/notReady=:Exists:NoExecute for 300s
  node.alpha.kubernetes.io/unreachable=:Exists:NoExecute for 300s
Events:
  FirstSeen LastSeen Count From SubObjectPath Type Reason Message
  --------- -------- ----- ---- ------------- -------- ------ -------
  1h 46m 27 kubelet, nodek1 Warning FailedSync Error syncing pod, skipping: failed to "CreatePodSandbox" for "kube-dns-3121064917-wk1rd_kube-system(6ebf84d6-442c-11e7-96b4-002590c30af2)" with CreatePodSandboxError: "CreatePodSandbox for pod \"kube-dns-3121064917-wk1rd_kube-system(6ebf84d6-442c-11e7-96b4-002590c30af2)\" failed: rpc error: code = 2 desc = NetworkPlugin cni failed to set up pod \"kube-dns-3121064917-wk1rd_kube-system\" network: Failed in PollVM. Error : Failed HTTP Get operation. Return code 404"

  1h 46m 27 kubelet, nodek1 Normal SandboxChanged Pod sandbox changed, it will be killed and re-created.
  46m 46m 1 kubelet, nodek1 spec.containers{kubedns} Normal Pulling pulling image "gcr.io/google_containers/k8s-dns-kube-dns-amd64:1.14.1"
  46m 46m 1 kubelet, nodek1 spec.containers{kubedns} Normal Pulled Successfully pulled image "gcr.io/google_containers/k8s-dns-kube-dns-amd64:1.14.1"
  46m 46m 1 kubelet, nodek1 spec.containers{kubedns} Normal Created Created container with id 8c7fff159158436b4c756e599458556c27e0be3b02f1992a0530c509dae8e52a
  46m 46m 1 kubelet, nodek1 spec.containers{kubedns} Normal Started Started container with id 8c7fff159158436b4c756e599458556c27e0be3b02f1992a0530c509dae8e52a
  46m 46m 1 kubelet, nodek1 spec.containers{dnsmasq} Normal Pulling pulling image "gcr.io/google_containers/k8s-dns-dnsmasq-nanny-amd64:1.14.1"
  45m 45m 1 kubelet, nodek1 spec.containers{dnsmasq} Normal Pulled Successfully pulled image "gcr.io/google_containers/k8s-dns-dnsmasq-nanny-amd64:1.14.1"
  45m 45m 1 kubelet, nodek1 spec.containers{dnsmasq} Normal Created Created container with id 00db2adb01c22c7299d2f1ffa675c07604ce267c9106af31977c995475cacd3c
  45m 45m 1 kubelet, nodek1 spec.containers{dnsmasq} Normal Started Started container with id 00db2adb01c22c7299d2f1ffa675c07604ce267c9106af31977c995475cacd3c
  45m 45m 1 kubelet, nodek1 spec.containers{sidecar} Normal Pulling pulling image "gcr.io/google_containers/k8s-dns-sidecar-amd64:1.14.1"
  45m 45m 1 kubelet, nodek1 spec.containers{sidecar} Normal Pulled Successfully pulled image "gcr.io/google_containers/k8s-dns-sidecar-amd64:1.14.1"
  45m 45m 1 kubelet, nodek1 spec.containers{sidecar} Normal Created Created container with id 5d085569216c02784bba10412f25e77d6153b180ea252512d4aa248705adc816
  45m 45m 1 kubelet, nodek1 spec.containers{sidecar} Normal Started Started container with id 5d085569216c02784bba10412f25e77d6153b180ea252512d4aa248705adc816

root@nodec1:~#

cni log:

I : 7706 : 2017/05/29 15:05:02 cni.go:90: &{cniArgs:0xc4202cf340 Mode:k8s VifType:veth VifParent:eth0 LogDir:/var/log/contrail/cni LogFile:/var/log/contrail/cni/opencontrail.log LogLevel:4 ContainerUuid:6dc66b31-442c-11e7-96b4-002590c30af2 ContainerName:kube-dns-38413634-wq0g2 ContainerVn: VRouter:{Server:127.0.0.1 Port:9091 Dir:/var/lib/contrail/ports/vm PollTimeout:5 PollRetries:15 containerUuid: containerVn: httpClient:0xc4201c96b0}}
I : 7706 : 2017/05/29 15:05:02 vrouter.go:384: {Server:127.0.0.1 Port:9091 Dir:/var/lib/contrail/ports/vm PollTimeout:5 PollRetries:15 containerUuid: containerVn: httpClient:0xc4201c96b0}
I : 7706 : 2017/05/29 15:05:02 veth.go:161: {CniIntf:{containerUuid:6dc66b31-442c-11e7-96b4-002590c30af2 containerIfName:eth0 containerNamespace: mtu:1500} HostIfName:tap6dc66b31-44 TmpHostIfName:tmp6dc66b31-44}
I : 7706 : 2017/05/29 15:05:02 veth.go:33: Deleting VEth interface {CniIntf:{containerUuid:6dc66b31-442c-11e7-96b4-002590c30af2 containerIfName:eth0 containerNamespace: mtu:1500} HostIfName:tap6dc66b31-44 TmpHostIfName:tmp6dc66b31-44}
I : 7706 : 2017/05/29 15:05:02 interface.go:49: Deleting interface tap6dc66b31-44
I : 7706 : 2017/05/29 15:05:02 interface.go:53: Interface tap6dc66b31-44 not present. Error Link not found
E : 7706 : 2017/05/29 15:05:02 veth.go:41: Deleted interface
I : 7706 : 2017/05/29 15:05:02 cni.go:206: Deleted interface eth0 inside container
I : 7706 : 2017/05/29 15:05:02 vrouter.go:339: Deleting container : 6dc66b31-442c-11e7-96b4-002590c30af2 Vn :
I : 7706 : 2017/05/29 15:05:02 vrouter.go:299: File /var/lib/contrail/ports/vm/6dc66b31-442c-11e7-96b4-002590c30af2 not found. Error : stat /var/lib/contrail/ports/vm/6dc66b31-442c-11e7-96b4-002590c30af2: no such file or directory
I : 7706 : 2017/05/29 15:05:02 vrouter.go:78: VRouter request. Operation : DELETE Url : http://127.0.0.1:9091/vm/6dc66b31-442c-11e7-96b4-002590c30af2

Jeba Paulaiyan (jebap)
tags: added: blocker
Revision history for this message
Praveen (praveen-karadakal) wrote :

CNI creates tap interface based on UUID for "pause" container. This creates problem in following sequence of container create/destroy for same dns-pod (UUID bf84d6-442c-11e7-96b4-002590c30af2),

I : 14981 : 2017/05/29 15:38:47 contrail-kube-cni.go:52: Came in Add for container 1e7accb7c7fe962790737fe437f82b6ea8b220633022855a97a9710e05463a6b

>>> This Add operation fails since agent does not have configuration yet. It creates tap interface tap6ebf84d6-44

I : 15267 : 2017/05/29 15:40:05 contrail-kube-cni.go:52: Came in Add for container 1107ef3bd6b8a05ec3aff23d07da087bf4c8fb02727b2d0ad7ac5f190330dc36

>>> This Add operation succeeds since agent has got configuration by this time. This container also uses tap name tap6ebf84d6-44

I : 16128 : 2017/05/29 15:41:02 contrail-kube-cni.go:82: Came in Del for container 1e7accb7c7fe962790737fe437f82b6ea8b220633022855a97a9710e05463a6b

>>> This Del operation removes the tap interface tap6ebf84d6-44. This results in deletion of tap interface for container-id 1107ef3bd6b8a05ec3aff23d07da087bf4c8fb02727b2d0ad7ac5f190330dc36

Solution:
Create tap interface based on container-id instead of UUID. Tap interface name format will be,
tap<6-upper-digits-of-container-id><6-lower-digit-of-container-id>

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R4.0

Review in progress for https://review.opencontrail.org/32260
Submitter: Praveen K V (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/32260
Committed: http://github.com/Juniper/contrail-controller/commit/1de1c4fcb49cdb636912d48a17cba5f793c2c495
Submitter: Zuul (<email address hidden>)
Branch: R4.0

commit 1de1c4fcb49cdb636912d48a17cba5f793c2c495
Author: Praveen K V <email address hidden>
Date: Tue May 30 14:00:00 2017 +0530

Build tap interface name based on container-id

Problem:
In some failure scenarios kubelet can retry spawning container. The
containers spawned will have different container-id, but same UUID. It
can so happen that create and delete of such containers can overlap.

CNI used to generate tap interface name based on UUID. As a result, the
tap interface generated on retry of containers remains unchanged. When
add/delete of container overlaps, its possible that deletion of
container removes a tap interface used by another container.

Fix:
Create tap interfaces based on container-id instead of container-uuid.
The container-id changes on every retry of container.

Change-Id: I9e51acd97b7a08560e780b138d228ff638546b2a
Closes-Bug: #1694317

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] master

Review in progress for https://review.opencontrail.org/32291
Submitter: Praveen K V (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/32291
Committed: http://github.com/Juniper/contrail-controller/commit/d58aaa646830d0d76af2f841f0d562dfaa5148a9
Submitter: Zuul (<email address hidden>)
Branch: master

commit d58aaa646830d0d76af2f841f0d562dfaa5148a9
Author: Praveen K V <email address hidden>
Date: Tue May 30 14:00:00 2017 +0530

Build tap interface name based on container-id

Problem:
In some failure scenarios kubelet can retry spawning container. The
containers spawned will have different container-id, but same UUID. It
can so happen that create and delete of such containers can overlap.

CNI used to generate tap interface name based on UUID. As a result, the
tap interface generated on retry of containers remains unchanged. When
add/delete of container overlaps, its possible that deletion of
container removes a tap interface used by another container.

Fix:
Create tap interfaces based on container-id instead of container-uuid.
The container-id changes on every retry of container.

Change-Id: I9e51acd97b7a08560e780b138d228ff638546b2a
Closes-Bug: #1694317
(cherry picked from commit 1de1c4fcb49cdb636912d48a17cba5f793c2c495)

Revision history for this message
Vedamurthy Joshi (vedujoshi) wrote :

The issue was seen later too.
Praveen has a fix for the same

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R4.0

Review in progress for https://review.opencontrail.org/32399
Submitter: Praveen K V (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/32399
Committed: http://github.com/Juniper/contrail-controller/commit/935075945b0ba0fc7a0941beb59c083efc877bbc
Submitter: Zuul (<email address hidden>)
Branch: R4.0

commit 935075945b0ba0fc7a0941beb59c083efc877bbc
Author: Praveen K V <email address hidden>
Date: Fri Jun 2 15:35:01 2017 +0530

Fix for ip-address not assigned to container on respawn

If a container fails (due to CNI failure or otherwise), kubelet will
delete the failed container and create a new one. In some cases kubelet
is calling CNI multiple times for failed container. We create
interface with names based on POD-UUID. So, both new and old container
will map to same tap interface name. If delete of the container comes
after new container is spawned, the delete must be ignored.

When new container is created, the config file stored by vrouter is
updated with new container-id. Compare the container-id in in this
instance with one present in the config file. Ignore the request if
they do not match. No message is sent to Agent and the config-file is
not deleted

Change-Id: I43f85e49f5304075e8ce32b13295e7cded752172
Closes-Bug: #1694317

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] master

Review in progress for https://review.opencontrail.org/32411
Submitter: Praveen K V (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/32411
Committed: http://github.com/Juniper/contrail-controller/commit/6e630f03a982cdac9b5bfe908a2da02a26951a8c
Submitter: Zuul (<email address hidden>)
Branch: master

commit 6e630f03a982cdac9b5bfe908a2da02a26951a8c
Author: Praveen K V <email address hidden>
Date: Fri Jun 2 15:35:01 2017 +0530

Fix for ip-address not assigned to container on respawn

If a container fails (due to CNI failure or otherwise), kubelet will
delete the failed container and create a new one. In some cases kubelet
is calling CNI multiple times for failed container. We create
interface with names based on POD-UUID. So, both new and old container
will map to same tap interface name. If delete of the container comes
after new container is spawned, the delete must be ignored.

When new container is created, the config file stored by vrouter is
updated with new container-id. Compare the container-id in in this
instance with one present in the config file. Ignore the request if
they do not match. No message is sent to Agent and the config-file is
not deleted

Change-Id: I43f85e49f5304075e8ce32b13295e7cded752172
Closes-Bug: #1694317
(cherry picked from commit 935075945b0ba0fc7a0941beb59c083efc877bbc)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.