Multicloud :: Azure OnPrem :: k8s pods stuck in state ContainerCreating

Bug #1800207 reported by Ritam Gangopadhyay
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R5.0
Incomplete
High
Ritam Gangopadhyay
Trunk
Incomplete
High
Ritam Gangopadhyay

Bug Description

Setup:-

contrail-controller and k8s_master - 10.87.74.129 - 192.168.1.1
contrail-compute and k8s_node - 10.87.74.130 - 192.168.1.2

root@5c3s1-node1:~# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE
ubuntuapp-custom1 1/1 Running 0 2d 10.1.1.4 5c3s1-node2
ubuntuapp-custom2 0/1 Pending 0 2d <none> <none>
ubuntuapp-custom4 0/1 ContainerCreating 0 4h <none> rg-compute-1
ubuntuapp-local1 0/1 ContainerCreating 0 4h <none> 5c3s1-node2
root@5c3s1-node1:~#

cni logs
********************
********************
********************

I : 14779 : 2018/10/26 08:00:14 contrail-kube-cni.go:53: Came in Add for container a74e4674c95bf8d9b298d497f92ccbe627a230059329695c4c06f5bd3d83daf6
I : 14779 : 2018/10/26 08:00:14 contrail-kube-cni.go:41: getPodInfo success. container-id a74e4674c95bf8d9b298d497f92ccbe627a230059329695c4c06f5bd3d83daf6 uuid fce83917-d92c-11e8-9931-0cc47afb60d8 name ubuntuapp-local1
I : 14779 : 2018/10/26 08:00:14 cni.go:88: ContainerID : a74e4674c95bf8d9b298d497f92ccbe627a230059329695c4c06f5bd3d83daf6
I : 14779 : 2018/10/26 08:00:14 cni.go:89: NetNS : /proc/14713/ns/net
I : 14779 : 2018/10/26 08:00:14 cni.go:90: Container Ifname : eth0
I : 14779 : 2018/10/26 08:00:14 cni.go:91: Args : IgnoreUnknown=1;K8S_POD_NAMESPACE=default;K8S_POD_NAME=ubuntuapp-local1;K8S_POD_INFRA_CONTAINER_ID=a74e4674c95bf8d9b298d497f92ccbe627a230059329695c4c06f5bd3d83daf6
I : 14779 : 2018/10/26 08:00:14 cni.go:92: CNI VERSION : 0.2.0
I : 14779 : 2018/10/26 08:00:14 cni.go:93: MTU : 1500
I : 14779 : 2018/10/26 08:00:14 cni.go:94: Config File : {"cniVersion":"0.2.0","contrail":{"config-dir":"/var/lib/contrail/ports/vm","log-file":"/var/log/contrail/cni/opencontrail.log","log-level":"4","poll-retries":15,"poll-timeout":5,"vrouter-ip":"127.0.0.1","vrouter-port":9091},"name":"contrail-k8s-cni","type":"contrail-k8s-cni"}
I : 14779 : 2018/10/26 08:00:14 cni.go:95: &{cniArgs:0xc4202d7340 Mode:k8s VifType:veth VifParent:eth0 LogDir:/var/log/contrail/cni LogFile:/var/log/contrail/cni/opencontrail.log LogLevel:4 Mtu:1500 ContainerUuid:fce83917-d92c-11e8-9931-0cc47afb60d8 ContainerName:ubuntuapp-local1 ContainerVn: VRouter:{Server:127.0.0.1 Port:9091 Dir:/var/lib/contrail/ports/vm PollTimeout:5 PollRetries:15 containerId: containerUuid: containerVn: httpClient:0xc4201cf710}}
I : 14779 : 2018/10/26 08:00:14 vrouter.go:446: {Server:127.0.0.1 Port:9091 Dir:/var/lib/contrail/ports/vm PollTimeout:5 PollRetries:15 containerId: containerUuid: containerVn: httpClient:0xc4201cf710}
I : 14779 : 2018/10/26 08:00:14 vrouter.go:79: VRouter request. Operation : GET Url : http://127.0.0.1:9091/vm-cfg/fce83917-d92c-11e8-9931-0cc47afb60d8
E : 14779 : 2018/10/26 08:00:14 vrouter.go:147: Failed HTTP Get operation. Return code 404
I : 14779 : 2018/10/26 08:00:14 vrouter.go:181: Iteration 0 : Get vrouter failed
I : 14779 : 2018/10/26 08:00:19 vrouter.go:79: VRouter request. Operation : GET Url : http://127.0.0.1:9091/vm-cfg/fce83917-d92c-11e8-9931-0cc47afb60d8
E : 14779 : 2018/10/26 08:00:19 vrouter.go:147: Failed HTTP Get operation. Return code 404
I : 14779 : 2018/10/26 08:00:19 vrouter.go:181: Iteration 1 : Get vrouter failed

********************
********************
********************

TRACEBACK SEEN IN KUBE-MANAGER.LOG

********************
********************
********************

The above is a description of an error in a Python program. Here is
the original traceback:

Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/kube_manager/vnc/vnc_kubernetes.py", line 538, in vnc_process
    self.service_mgr.process(event)
  File "/usr/lib/python2.7/site-packages/kube_manager/vnc/vnc_service.py", line 686, in process
    specified_fip_pool_fq_name_str)
  File "/usr/lib/python2.7/site-packages/kube_manager/vnc/vnc_service.py", line 528, in vnc_service_add
    service_ip, ports)
  File "/usr/lib/python2.7/site-packages/kube_manager/vnc/vnc_service.py", line 263, in _lb_create
    service_namespace, service_ip)
  File "/usr/lib/python2.7/site-packages/kube_manager/vnc/vnc_service.py", line 255, in _vnc_create_lb
    vn_obj, service_ip, service_ipam_subnet_uuid)
  File "/usr/lib/python2.7/site-packages/kube_manager/vnc/loadbalancer.py", line 192, in create
    tags=tags)
  File "/usr/lib/python2.7/site-packages/kube_manager/vnc/loadbalancer.py", line 115, in _create_virtual_interface
    self._vnc_lib.instance_ip_update(iip_obj)
  File "/usr/lib/python2.7/site-packages/vnc_api/vnc_api.py", line 51, in wrapper
    return func(self, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vnc_api/vnc_api.py", line 732, in _object_update
    OP_PUT, uri, data=json_body)
  File "/usr/lib/python2.7/site-packages/vnc_api/vnc_api.py", line 1032, in _request_server
    retry_after_authn=retry_after_authn, retry_count=retry_count)
  File "/usr/lib/python2.7/site-packages/vnc_api/vnc_api.py", line 1085, in _request
    % (op, url, data, content))
NoIdError: Unknown id: Error: oper 3 url /instance-ip/f7bd3f52-a758-4056-86b2-4cde2fffa50f body {"instance-ip":{"fq_name": ["kube-dns__f7bd3f52-a758-4056-86b2-4cde2fffa50f"], "uuid": "f7bd3f52-a758-4056-86b2-4cde2fffa50f", "instance_ip_address": "10.96.0.10", "virtual_machine_interface_refs": [{"to": ["default-domain", "k8s-kube-system", "kube-dns__e3b2eeb7-168b-4cbb-8842-8017a6b952c9"], "uuid": "e3b2eeb7-168b-4cbb-8842-8017a6b952c9"}], "virtual_network_refs": [{"to": ["default-domain", "k8s-default", "k8s-default-service-network"], "uuid": "a27c3100-8e72-4849-882f-68148a87bfb4"}], "subnet_uuid": "dfeaaa78-c487-4479-9cf8-3e181d75d9bb", "display_name": "kube-dns"}} response Unknown id: f7bd3f52-a758-4056-86b2-4cde2fffa50f

10/26/2018 03:15:01 PM [contrail-kube-manager] [WARNING]: Add/Modify endpoints event received while service kubernetes does not exist
10/26/2018 03:15:01 PM [contrail-kube-manager] [WARNING]: __default__ [SYS_WARN]: KubeManagerWarningLog: Add/Modify endpoints event received while service kubernetes does not exist
10/26/2018 03:15:01 PM [contrail-kube-manager] [WARNING]: Add/Modify endpoints event received while service kube-dns does not exist
10/26/2018 03:15:01 PM [contrail-kube-manager] [WARNING]: __default__ [SYS_WARN]: KubeManagerWarningLog: Add/Modify endpoints event received while service kube-dns does not exist

********************
********************
********************

the uuid for the instance IP does not exist in config api but the vmi ref does
root@5c3s1-node1:~# curl -s http://192.168.1.1:8082/virtual-machine-interfaces | mjson | grep 4cde2fffa50f
root@5c3s1-node1:~# curl -s http://192.168.1.1:8082/virtual-machine-interfaces | mjson | grep 8017a6b952c9
               "kube-dns__e3b2eeb7-168b-4cbb-8842-8017a6b952c9"
           "href": "http://192.168.1.1:8082/virtual-machine-interface/e3b2eeb7-168b-4cbb-8842-8017a6b952c9",
           "uuid": "e3b2eeb7-168b-4cbb-8842-8017a6b952c9"
root@5c3s1-node1:~#
"uuid": "e3b2eeb7-168b-4cbb-8842-8017a6b952c9",
       "virtual_machine_interface_device_owner": "K8S:LOADBALANCER",
       "virtual_machine_interface_mac_addresses": {
           "mac_address": [
               "02:e3:b2:ee:b7:16"
           ]
       },

Jeba Paulaiyan (jebap)
tags: added: blocker
Revision history for this message
Ritam Gangopadhyay (ritam) wrote :

Not a blocker, k8s cleanup with drain and delete of all nodes and re-join fixes the issue.

tags: removed: blocker
Revision history for this message
Sanju Abraham (asanju) wrote :

This is a more of a config API issue as listed in the curl output.

The uuid for the instance IP does not exist in config api but the vmi ref does.

Assigning to Yogi

tags: added: releasenote
Revision history for this message
Sathish Holla (sathishholla) wrote :

Hi Ritham,

Do you have the logs for this stored anywhere ?
From the description, I understand the Instance IP was deleted, but VMI still had back-refs to deleted Instance IPs. Please let me know if my understanding is wrong.

Thanks,
Sathish

Revision history for this message
Ritam Gangopadhyay (ritam) wrote :

The logs that I collected are provided in the bug description. I don't have any other logs.

Revision history for this message
Sathish Holla (sathishholla) wrote :

Hi Ritam,

With just those prints, it's unlikely that we can debug the issue further.
The issue can happen in various scenarios, like DB corruption (ZK and Cassandra mismatch), Wrong Values sent from client, code issues in config.
In order to isolate it further, we will need config logs collected during the issue reproduction.

I'll move the LP back on your name as we don't have enough logs to investigate further.
Please re-open the bug if we see this issue again.

Thanks,
Sathish

Revision history for this message
Ritam Gangopadhyay (ritam) wrote :

Hi Satish,

   Can you please list out the log files to be collected if I hit the issue again?

   Holding the setup for such log period would be a bit difficult. Please let me the logs that you would like to look at and I will save them in a location for your scrutiny. Do revert the LP back to me with the info to have a tag to work on it.

Regards,
Ritam.

Revision history for this message
Sathish Holla (sathishholla) wrote :

Hi Ritam,

I understand. When this issue occurs, please collect the following logs:

Controller Logs:
We will need the config-api.log. You would rather collect all the logs in /var/log/contrail on all the controller nodes, that's easier.

Analytics Logs:
1. contrail-logs --object-type config --last 1h ===> The last argument can be changed based on when the issue occurred.
2. contrail-logs --object-type config-user --last 1h ===> The last argument can be changed based on when the issue occurred.
More info on analytic logs can be found at: https://github.com/Juniper/contrail-controller/wiki/Contrail-utility-scripts-for-getting-logs,-stats-and-flows-from-analytics

Config DB dump:
From any one of the config_api docker, you will need to collect the Config DB dump as:
1. cd to /usr/lib/python2.7/dist-packages/cfgm_common(or /usr/lib/python2.7/site-packages/cfgm_common depending on the underlying OS).
2. python db_json_exim.py --export-to db-dump.json

More info about this is at:
https://www.juniper.net/documentation/en_US/contrail5.0/topics/concept/backup-using-json-40.html

The logs i've requested here are not limited to this issue actually. In general, for any config related issue, we will generally require these logs.

-Sathish

Revision history for this message
Jeba Paulaiyan (jebap) wrote :

Notes:

While provisioning a k8s cluster, the k8s pods sometimes get stuck in ContainerCreate state. k8s cleanup with drain and delete of all nodes and re-join fixes the issue.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.