k8s:pod creation fails after the config api restart

Bug #1793269 reported by Venkatesh Velpula
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R5.0
Triaged
High
Sathish Holla
Trunk
Triaged
High
Sathish Holla

Bug Description

After restart of the config api the pod creation fails .we hit this issue intermittently ,but if we hit this issue all the testcases will fail after teh config api restart .

right now i kept the setup intact in problem state ..could you please look at it.

Build :5.1.0-250
Deployment :Ansible_deployer
HOST OS: CENTOS7.5
=======================

Topology
==================
vrouter +k8s_node:

      ip: nodec60
      ip: nodec61

config +control++kubemanager:

      ip: nodeg12(k8s_master)
      ip: nodeg31
      ip: nodec58

on agent
========
018-09-19 Wed 13:08:05:766.554 IST nodec60 [Thread 140180966799104, Pid 1807]: [SYS_INFO]: EndpointSecurityStats: name = default-global-system-config:nodec60:vhost0 eps= [ [ _iter106->first = 00000000-0000-0000-0000-000000000001 [ workload = client= [ [ [ app = tier = site = deployment = labels = custom_tags = remote_app_id = remote_tier_id = remote_site_id = remote_deployment_id = remote_label_ids = remote_custom_tag_ids = remote_prefix = remote_vn = default-domain:default-project:ip-fabric local_vn = default-domain:default-project:ip-fabric added = 0 deleted = 0 active = 7 in_bytes = 6291 out_bytes = 50500 in_pkts = 81 out_pkts = 94 action = pass ], ] ] server= [ [ [ app = tier = site = deployment = labels = custom_tags = remote_app_id = remote_tier_id = remote_site_id = remote_deployment_id = remote_label_ids = remote_custom_tag_ids = remote_prefix = remote_vn = default-domain:default-project:ip-fabric local_vn = default-domain:default-project:ip-fabric added = 0 deleted = 0 active = 7 in_bytes = 50500 out_bytes = 6291 in_pkts = 94 out_pkts = 81 action = pass ], ] ] ], ] ] file = controller/src/vnsw/agent/uve/interface_uve_stats_table.cc line = 135
2018-09-19 Wed 13:08:07:439.498 IST nodec60 [Thread 140180958402304, Pid 1807]: SANDESH: Sending: LEVEL: [ INVALID ] -> [ SYS_DEBUG ] : 2053
2018-09-19 Wed 13:08:07:439.910 IST nodec60 [Thread 140180958402304, Pid 1807]: SANDESH: Sending: LEVEL: [ SYS_DEBUG ] -> [ INVALID ] : 0
2018-09-19 Wed 13:08:35:767.377 IST nodec60 [Thread 140180970997504, Pid 1807]: [SYS_INFO]: EndpointSecurityStats: name = default-global-system-config:nodec60:vhost0 eps= [ [ _iter106->first = 00000000-0000-0000-0000-000000000001 [ workload = client= [ [ [ app = tier = site = deployment = labels = custom_tags = remote_app_id = remote_tier_id = remote_site_id = remote_deployment_id = remote_label_ids = remote_custom_tag_ids = remote_prefix = remote_vn = default-domain:default-project:ip-fabric local_vn = default-domain:default-project:ip-fabric added = 0 deleted = 0 active = 7 in_bytes = 6446 out_bytes = 46142 in_pkts = 84 out_pkts = 86 action = pass ], ] ] server= [ [ [ app = tier = site = deployment = labels = custom_tags = remote_app_id = remote_tier_id = remote_site_id = remote_deployment_id = remote_label_ids = remote_custom_tag_ids = remote_prefix = remote_vn = default-domain:default-project:ip-fabric local_vn = default-domain:default-project:ip-fabric added = 0 deleted = 0 active = 7 in_bytes = 46142 out_bytes = 6446 in_pkts = 86 out_pkts = 84 action = pass ], ] ] ], ] ] file = controller/src/vnsw/agent/uve/interface_uve_stats_table.cc line = 135
2018-09-19 Wed 13:08:37:440.518 IST nodec60 [Thread 140180954203904, Pid 1807]: SANDESH: Sending: LEVEL: [ INVALID ] -> [ SYS_DEBUG ] : 2185
2018-09-19 Wed 13:08:37:440.917 IST nodec60 [Thread 140180954203904, Pid 1807]: SANDESH: Sending: LEVEL: [ SYS_DEBUG ] -> [ INVALID ] : 0

cni logs
=========
E : 27646 : 2018/09/19 13:10:30 contrail-kube-cni.go:68: Failed processing Add command.
I : 27880 : 2018/09/19 13:10:32 contrail-kube-cni.go:53: Came in Add for container 86d6db19312882197c71999e7fb13f51dd6866ab0310061655ff63e91f90acd5
I : 27880 : 2018/09/19 13:10:32 contrail-kube-cni.go:41: getPodInfo success. container-id 86d6db19312882197c71999e7fb13f51dd6866ab0310061655ff63e91f90acd5 uuid 1ad2a993-bbdf-11e8-88fd-002590c476a0 name test-75c49697d7-lq7bs
I : 27880 : 2018/09/19 13:10:32 cni.go:88: ContainerID : 86d6db19312882197c71999e7fb13f51dd6866ab0310061655ff63e91f90acd5
I : 27880 : 2018/09/19 13:10:32 cni.go:89: NetNS : /proc/27836/ns/net
I : 27880 : 2018/09/19 13:10:32 cni.go:90: Container Ifname : eth0
I : 27880 : 2018/09/19 13:10:32 cni.go:91: Args : IgnoreUnknown=1;K8S_POD_NAMESPACE=default;K8S_POD_NAME=test-75c49697d7-lq7bs;K8S_POD_INFRA_CONTAINER_ID=86d6db19312882197c71999e7fb13f51dd6866ab0310061655ff63e91f90acd5
I : 27880 : 2018/09/19 13:10:32 cni.go:92: CNI VERSION : 0.2.0
I : 27880 : 2018/09/19 13:10:32 cni.go:93: MTU : 1500
I : 27880 : 2018/09/19 13:10:32 cni.go:94: Config File : {"cniVersion":"0.2.0","contrail":{"config-dir":"/var/lib/contrail/ports/vm","log-file":"/var/log/contrail/cni/opencontrail.log","log-level":"4","poll-retries":15,"poll-timeout":5,"vrouter-ip":"127.0.0.1","vrouter-port":9091},"name":"contrail-k8s-cni","type":"contrail-k8s-cni"}
I : 27880 : 2018/09/19 13:10:32 cni.go:95: &{cniArgs:0xc4202ab340 Mode:k8s VifType:veth VifParent:eth0 LogDir:/var/log/contrail/cni LogFile:/var/log/contrail/cni/opencontrail.log LogLevel:4 Mtu:1500 ContainerUuid:1ad2a993-bbdf-11e8-88fd-002590c476a0 ContainerName:test-75c49697d7-lq7bs ContainerVn: VRouter:{Server:127.0.0.1 Port:9091 Dir:/var/lib/contrail/ports/vm PollTimeout:5 PollRetries:15 containerId: containerUuid: containerVn: httpClient:0xc4201a3710}}
I : 27880 : 2018/09/19 13:10:32 vrouter.go:446: {Server:127.0.0.1 Port:9091 Dir:/var/lib/contrail/ports/vm PollTimeout:5 PollRetries:15 containerId: containerUuid: containerVn: httpClient:0xc4201a3710}
I : 27880 : 2018/09/19 13:10:32 vrouter.go:79: VRouter request. Operation : GET Url : http://127.0.0.1:9091/vm-cfg/1ad2a993-bbdf-11e8-88fd-002590c476a0
E : 27880 : 2018/09/19 13:10:32 vrouter.go:147: Failed HTTP Get operation. Return code 404
I : 27880 : 2018/09/19 13:10:32 vrouter.go:181: Iteration 0 : Get vrouter failed

[root@nodec60 contrail]# contrail-status
Pod Service Original Name State Status
vrouter agent contrail-vrouter-agent running Up 19 hours
vrouter nodemgr contrail-nodemgr running Up 19 hours

vrouter kernel module is PRESENT
== Contrail vrouter ==
nodemgr: active
agent: active

[root@nodec60 contrail]

config api
===========
u'request-id': u'req-d432efc2-1912-42dc-93e6-fd78fe859311',
 u'type': u'virtual_machine_interface',
 u'uuid': u'1af781aa-bbdf-11e8-b7ef-002590c55f6a'}
09/19/2018 01:09:13 PM [contrail-api] [DEBUG]: Add uve <default-domain:k8s-default:test-75c49697d7-lq7bs__1af781aa-bbdf-11e8-b7ef-002590c55f6a, 1873> in the [ObjectVMITable:ContrailConfigTrace] map
09/19/2018 01:09:13 PM [contrail-api] [INFO]: __default__ [SYS_INFO]: VncApiLatencyStatsLog: node_name = issu-vm6 api_latency_stats = << operation_type = MULTIGET application = CASSANDRA response_time_in_usec = 6051 response_size = 0 identifier = req-d432efc2-1912-42dc-93e6-fd78fe859311 >>
09/19/2018 01:09:13 PM [contrail-api] [INFO]: __default__ [SYS_INFO]: VncApiConfigLog: api_log = << identifier_uuid = 1af781aa-bbdf-11e8-b7ef-002590c55f6a object_type = virtual_machine_interface identifier_name = default-domain:k8s-default:test-75c49697d7-lq7bs__1af781aa-bbdf-11e8-b7ef-002590c55f6a url = http://127.0.0.1/ref-update operation = ref-update domain = default-domain >>
09/19/2018 01:09:13 PM [contrail-api] [INFO]: __default__ [SYS_INFO]: VncApiLatencyStatsLog: node_name = issu-vm6 api_latency_stats = << operation_type = MULTIGET application = CASSANDRA response_time_in_usec = 1422 response_size = 0 identifier = req-d432efc2-1912-42dc-93e6-fd78fe859311 >>
09/19/2018 01:09:13 PM [contrail-api] [INFO]: __default__ [SYS_INFO]: VncApiLatencyStatsLog: node_name = issu-vm6 api_latency_stats = << operation_type = MULTIGET application = CASSANDRA response_time_in_usec = 1489 response_size = 0 identifier = req-d432efc2-1912-42dc-93e6-fd78fe859311 >>
09/19/2018 01:09:13 PM [contrail-api] [INFO]: __default__ [SYS_INFO]: VncApiLatencyStatsLog: node_name = issu-vm6 api_latency_stats = << operation_type = MULTIGET application = CASSANDRA response_time_in_usec = 899 response_size = 0 identifier = req-d432efc2-1912-42dc-93e6-fd78fe859311 >>
09/19/2018 01:09:13 PM [contrail-api] [INFO]: __default__ [SYS_INFO]: VncApiLatencyStatsLog: node_name = issu-vm6 api_latency_stats = << operation_type = SEND application = CASSANDRA response_time_in_usec = 1413 response_size = 0 identifier = req-d432efc2-1912-42dc-93e6-fd78fe859311 >>
09/19/2018 01:09:13 PM [contrail-api] [INFO]: __default__ [SYS_INFO]: VncApiLatencyStatsLog: node_name = issu-vm6 api_latency_stats = << operation_type = MULTIGET application = CASSANDRA response_time_in_usec = 877 response_size = 0 identifier = req-d432efc2-1912-42dc-93e6-fd78fe859311 >>
09/19/2018 01:09:13 PM [contrail-api] [INFO]: __default__ [SYS_INFO]: VncApiStatsLog: api_stats = << operation_type = POST user = useragent = nodeg31:/usr/bin/contrail-kube-manager remote_ip = 10.204.217.71 domain_name = default-domain project_name = default-project object_type = virtual_machine_interface response_time_in_usec = 42966 response_size = 529 resp_code = 200 >>
09/19/2018 01:09:13 PM [contrail-api] [DEBUG]: __default__ [SYS_DEBUG]: VncApiDebug: Notification Message: {u'fq_name': [u'default-domain',
              u'k8s-default',
              u'test-75c49697d7-lq7bs__1af781aa-bbdf-11e8-b7ef-002590c55f6a'],
 u'oper': u'UPDATE',
 u'request-id': u'req-d432efc2-1912-42dc-93e6-fd78fe859311',
 u'type': u'virtual_machine_interface',
 u'uuid': u'1af781aa-bbdf-11e8-b7ef-002590c55f6a'}
09/19/2018 01:09:13 PM [contrail-api] [INFO]: __default__ [SYS_INFO]: VncApiLatencyStatsLog: node_name = issu-vm6 api_latency_stats = << operation_type = MULTIGET application = CASSANDRA response_time_in_usec = 1117 response_size = 0 identifier = req-d432efc2-1912-42dc-93e6-fd78fe859311 >>
09/19/2018 01:09:14 PM [contrail-api] [DEBUG]: __default__ [SYS_DEBUG]: VncApiDebug: Notification Message: {u'fq_name': [u'default-domain',
              u'k8s-default',
              u'test-75c49697d7-lq7bs__1af781aa-bbdf-11e8-b7ef-002590c55f6a'],
 u'oper': u'UPDATE',
 u'request-id': u'req-c2ac9e16-9f35-4e4f-92c5-4f26372e9022',
 u'type': u'virtual_machine_interface',
 u'uuid': u'1af781aa-bbdf-11e8-b7ef-002590c55f6a'}
09/19/2018 01:09:14 PM [contrail-api] [INFO]: __default__ [SYS_INFO]: VncApiLatencyStatsLog: node_name = issu-vm6 api_latency_stats = << operation_type = MULTIGET application = CASSANDRA response_time_in_usec = 1219 response_size = 0 identifier = req-c2ac9e16-9f35-4e4f-92c5-4f26372e9022 >>
09/19/2018 01:09:14 PM [contrail-api] [INFO]: __default__ [SYS_INFO]: VncApiLatencyStatsLog: node_name = issu-vm6 api_latency_stats = << operation_type = MULTIGET application = CASSANDRA response_time_in_usec = 1144 response_size = 0 identifier = req-d432efc2-1912-42dc-93e6-fd78fe859311 >>
09/19/2018 01:09:14 PM [contrail-api] [INFO]: __default__ [SYS_INFO]: VncApiStatsLog: api_stats = << operation_type = GET user = useragent = nodeg31:/usr/bin/contrail-kube-manager remote_ip = 10.204.217.71 domain_name = default-domain project_name = default-project object_type = virtual_machine_interface response_time_in_usec = 2405 response_size = 2666 resp_code = 200 >>
09/19/2018 01:09:14 PM [contrail-api] [DEBUG]: __default__ [SYS_DEBUG]: VncApiDebug: Notification Message: {u'fq_name': [u'test-75c49697d7-lq7bs__1b02027e-bbdf-11e8-b7ef-002590c55f6a'],
 u'obj_dict': {u'display_name': u'test-75c49697d7-lq7bs__1b02027e-bbdf-11e8-b7ef-002590c55f6a',
               u'fq_name': [u'test-75c49697d7-lq7bs__1b02027e-bbdf-11e8-b7ef-002590c55f6a'],
               u'id_perms': {u'created': u'2018-09-19T07:39:14.035968',
                             u'creator': None,
                             u'description': None,
                             u'enable': True,
                             u'last_modified': u'2018-09-19T07:39:14.035968',
                             u'permissions': {u'group': u'cloud-admin-group',
                                              u'group_access': 7,
                                              u'other_access': 7,
                                              u'owner': u'cloud-admin',
                                              u'owner_access': 7},
                             u'user_visible': True,
                             u'uuid': {u'uuid_lslong': 13253812389717303146L,
                                       u'uuid_mslong': 1946120732318568936}},
               u'instance_ip_address': u'10.47.255.251',
               u'perms2': {u'global_access': 0,
                           u'owner': u'cloud-admin',
                           u'owner_access': 7,
                           u'share': []},
               u'subnet_uuid': u'eff1f49d-cd1b-459b-a5d8-31a54440f83f',
               u'uuid': u'1b02027e-bbdf-11e8-b7ef-002590c55f6a',
               u'virtual_machine_interface_refs': [{u'to': [u'default-domain',
                                                            u'k8s-default',
                                                            u'test-75c49697d7-lq7bs__1af781aa-bbdf-11e8-b7ef-002590c55f6a'],
                                                    u'uuid': u'1af781aa-bbdf-11e8-b7ef-002590c55f6a'}],
               u'virtual_network_refs': [{u'to': [u'default-domain',
                                                  u'k8s-default',
                                                  u'k8s-default-pod-network'],
                                          u'uuid': u'c2ac9d50-27bb-4d18-b6c3-715bc88506a0'}]},

== Contrail control ==
control: active
nodemgr: active
named: active
dns: active

== Contrail config-database ==
nodemgr: initializing (Disk for DB is too low. )
zookeeper: active
rabbitmq: active
cassandra: active

== Contrail kubernetes ==
kube-manager: active

== Contrail database ==
kafka: active
nodemgr: initializing (Disk for DB is too low. )
zookeeper: active
cassandra: active

== Contrail analytics ==
snmp-collector: active
query-engine: active
api: active
alarm-gen: active
nodemgr: active
collector: active
topology: active

== Contrail webui ==
web: active
job: active

== Contrail config ==
svc-monitor: active
nodemgr: active
device-manager: active
api: active
schema: active

Revision history for this message
Venkatesh Velpula (vvelpula) wrote :
description: updated
Revision history for this message
Venkatesh Velpula (vvelpula) wrote :

attached contrail-api log while the config api restart event has occure

Changed in juniperopenstack:
assignee: Sachchidanand Vaidya (vaidyasd) → Venkatraman Venkatapathy (vvenkatapath)
Revision history for this message
Venkatraman Venkatapathy (vvenkatapath) wrote :

Taking a look. Please keep the cluster in the same state

Revision history for this message
Venkatraman Venkatapathy (vvenkatapath) wrote :

On debugging, looks like the config node is unaware of this pod at that moment. So no entry is trickled to control node and the agent. It could be a case of either rabbitMq notification or config node processing after the config api reset

Changed in juniperopenstack:
assignee: Venkatraman Venkatapathy (vvenkatapath) → Shivayogi Ugaji (shivayogi123)
Changed in juniperopenstack:
assignee: Shivayogi Ugaji (shivayogi123) → Sathish Holla (sathishholla)
Revision history for this message
Sathish Holla (sathishholla) wrote :

It looks like there's a rabbitMQ cluster partition on the controllers.
Active schema exists at Nodeg31. The rabbitMQ on this node is not part of the RabbitMQ cluster where other two nodes are part of.

root@nodeg12:/# rabbitmqctl cluster_status
Cluster status of node contrail@nodeg12
[{nodes,[{disc,[contrail@nodec58,contrail@nodeg12]}]},
 {running_nodes,[contrail@nodec58,contrail@nodeg12]}, <==== Here, only nodec58 and nodeg12 are part of cluster
 {cluster_name,<<"<email address hidden>">>},
 {partitions,[]},
 {alarms,[{contrail@nodec58,[]},{contrail@nodeg12,[]}]}]
root@nodeg12:/#

root@nodeg31:/# rabbitmqctl cluster_status
Cluster status of node contrail@nodeg31
[{nodes,[{disc,[contrail@nodeg31]}]},
 {running_nodes,[contrail@nodeg31]}, <==== Here, only nodeg31 is part of the cluster.
 {cluster_name,<<"<email address hidden>">>},
 {partitions,[]},
 {alarms,[{contrail@nodeg31,[]}]}]
root@nodeg31:/#

root@nodec58:/# rabbitmqctl cluster_status
Cluster status of node contrail@nodec58
[{nodes,[{disc,[contrail@nodec58,contrail@nodeg12]}]},
 {running_nodes,[contrail@nodeg12,contrail@nodec58]},
 {cluster_name,<<"<email address hidden>">>},
 {partitions,[]},
 {alarms,[{contrail@nodeg12,[]},{contrail@nodec58,[]}]}]
root@nodec58:/#

Revision history for this message
Sathish Holla (sathishholla) wrote :

From the logs, it also look like the problem with RabbitMQ starts at around 09/18/2018 06:10:45 PM, which occurs before the config_api was rebooted at around 09/19/2018 01:09:13 PM.

To recover, the rabbitMQ cluster will need to be brought back to healthy state on all three controller nodes.

Revision history for this message
Venkatesh Velpula (vvelpula) wrote :

Hey Sathish ,

   I am recreating the setup and let you know .

thanks
-Venky

Changed in juniperopenstack:
assignee: Sathish Holla (sathishholla) → Venkatesh Velpula (vvelpula)
Revision history for this message
Venkatesh Velpula (vvelpula) wrote :

HI Sathish ..now the setup is in problem state .could you please take a look

thanks
-Venky

Changed in juniperopenstack:
assignee: Venkatesh Velpula (vvelpula) → Sathish Holla (sathishholla)
Revision history for this message
Sathish Holla (sathishholla) wrote :
Download full text (5.0 KiB)

Hi Venkatesh,

I don't see config installed on any of the nodeg12(k8s_master), nodeg31, nodec58

Please see output of docker ps and contrail-status attached.
Please let me know the new IPs if they have been installed elsewhere.

Thanks,
Sahtish

[root@nodeg12 ~]# cat /etc/contrail/common.env
KUBERNETES_API_SERVER=77.77.1.20
VROUTER_GATEWAY=77.77.1.100
TTY=True
LOG_LEVEL=SYS_DEBUG
KUBERNETES_IP_FABRIC_SUBNETS=77.77.1.160/27
CONTAINER_REGISTRY=10.204.217.152:5000
CONTRAIL_VERSION=queens-5.0-275
STDIN_OPEN=True
METADATA_PROXY_SECRET=c0ntrail123
WEBUI_NODES=10.204.217.71,10.204.217.52,10.204.217.98
CONTROLLER_NODES=10.204.217.52,10.204.217.71,10.204.217.98 <==== SATHISH: I checked these 3 nodes.
KUBERNETES_API_NODES=77.77.1.20
REGISTRY_PRIVATE_INSECURE=True
VNC_CURL_LOG_NAME=vnc_logs_k8s.log
RABBITMQ_NODE_PORT=5673
CONTROL_NODES=77.77.1.20,77.77.1.30,77.77.1.11
KUBERNETES_PUBLIC_FIP_POOL={u'project': u'k8s-default', u'domain': u'default-domain', u'name': u'__fip_pool_public__', u'network': u'__public__'}
CLOUD_ORCHESTRATOR=kubernetes
JVM_EXTRA_OPTS=-Xms1g -Xmx2g
[root@nodeg12 ~]#
[root@nodeg12 ~]#
[root@nodeg12 ~]#
[root@nodeg12 ~]#
[root@nodeg12 ~]# ssh root@10.204.217.71
root@10.204.217.71's password:
Last login: Mon Oct 1 22:23:22 2018 from nodeg12.englab.juniper.net
[root@nodeg31 ~]# contrail-status
-bash: contrail-status: command not found
[root@nodeg31 ~]# docker ps <=== SATHISH: No config node container seen here. even with docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
[root@nodeg31 ~]# exit
logout
Connection to 10.204.217.71 closed.

[root@nodeg31 ~]# ssh root@10.204.217.52
root@10.204.217.52's password:
Last login: Mon Oct 1 22:26:08 2018 from nodeg31.englab.juniper.net
[root@nodeg12 ~]# docker ps <=== SATHISH: No config node container seen here. even with docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
12042bf766f5 gcr.io/google_containers/kube-proxy-amd64 "/usr/local/bin/kube…" 2 hours ago Up 2 hours k8s_kube-proxy_kube-proxy-cwkkr_kube-system_a7f6eaa7-c58a-11e8-8b10-002590c476a0_0
fc9749808a52 gcr.io/google_containers/pause-amd64:3.0 "/pause" 2 hours ago Up 2 hours k8s_POD_kube-proxy-cwkkr_kube-system_a7f6eaa7-c58a-11e8-8b10-002590c476a0_0
47c4c9910ceb gcr.io/google_containers/etcd-amd64 "etcd --listen-clien…" 2 hours ago Up 2 hours k8s_etcd_etcd-nodeg12_kube-system_7278f85057e8bf5cb81c9f96d3b25320_0
4c9bb533fa01 gcr.io/google_containers/kube-scheduler-amd64 "kube-scheduler --ad…" 2 hours ago Up 2 hours k8s_kube-scheduler_kube-scheduler-nodeg12_kube-system_69c12074e336b0dbbd0a1666ce05226a_0
f98ce02b794e gcr.io/google_containers/kube-apiserver-amd64 "kube-apiserver --pr…" 2 hours ago Up 2 hours ...

Read more...

Revision history for this message
Pramodh D'Souza (psdsouza) wrote : FW: Wrong VRF and DHCP lease after re-creating VM / NEC/Juniper Engineering review of Etisalat VeCPE project 2018-0806-0353 / LP786240
Download full text (18.0 KiB)

 This is the email from Raja earlier when I looked at it, I don’t know about the current repro.
In this he clearly mentions “have all interfaces in 65535 VRF”.
Pramodh

From: Pramodh D'Souza <email address hidden>
Date: Wednesday, September 12, 2018 at 4:43 PM
To: Rajakumar David <email address hidden>
Cc: Yuvaraja Mariappan <email address hidden>
Subject: Re: Wrong VRF and DHCP lease after re-creating VM / NEC/Juniper Engineering review of Etisalat VeCPE project 2018-0806-0353 / LP786240

Quick observation:
Noticed these logs in sdnvcpe04cn_Snh_SandeshTraceRequest_Config.log corresponding to c328ec46-2c35-4211-b560-ea5b309cdc8d.

ConfigAddPortEnqueue: op = Add port_uuid = 7a1ccbc9-de44-49fa-9183-179383792538 instance_uuid = c328ec46-2c35-4211-b560-ea5b309cdc8d vn_uuid = feb5da8e-fb2b-412f-a96c-532542be5630 ip_address = 192.168.116.10 system_name = tapsdnvcpe04

ConfigAddPortEnqueue: op = Add port_uuid = 40166ed9-592e-4e90-9622-e8edaeccabf8 instance_uuid = c328ec46-2c35-4211-b560-ea5b309cdc8d vn_uuid = 23c6b87f-b4ec-4b53-a486-305914f44587 ip_address = 10.101.178.134 system_name = tap40166ed9-59 mac_address = 02:40:24:37:aa:16 display_name = VM_IPFE_M_VNFC_Record___0000029683_000973 tx_vlan_id = -1 rx_vlan_id = -1 vm_project_uuid = f6d14e50-ca94-415a-b8df-fbd418db6783 port_type = CfgIntVMPort ip6_address = None file = controller/src/vnsw/agent/port_ipc/port_ipc_handler.cc line = 366

ConfigAddPortEnqueue: op = Add port_uuid = f9644a68-6646-43aa-9b5d-323751bed387 instance_uuid = c328ec46-2c35-4211-b560-ea5b309cdc8d vn_uuid = b12a36da-4383-4d80-801a-ad81ba
a3929c ip_address = 1.2.3.252 system_name = tapf9644a68-66 mac_address = 02:5e:e9:80:21:ca display_name = VM_IPFE_M_VNFC_Record___0000029683_000973 tx_vlan_id = -1 rx_vlan_id = -1 vm_project_uuid = f6d14e50-ca94-415a-b8df-fbd418db6783 port_type = CfgIntVMPort ip6_address = None file = controller/src/vnsw/agent/port_ipc/port_ipc_handler.cc line = 366

ConfigAddPortEnqueue: op = Add port_uuid = 7a1ccbc9-de44-49fa-9
183-179383792538 instance_uuid = c328ec46-2c35-4211-b560-ea5b309cdc8d vn_uuid = feb5da8e-fb2b-412f-a96c-532542be5630 ip_address = 192.168.116.10 system_name = tap7a1ccbc9-de mac_address = 02:39:74:b4:5f:0b display_name = VM_IPFE_M_VNFC_Record___0000029683_000973 tx_vlan_id = -1 rx_vlan_id = -1 vm_project_uuid = f6d14e50-ca94-415a-b8df-fbd418db6783 port_type = CfgIntVMPort ip6_address = None file = controller/src/vnsw/agent/port_ipc/port_ipc_handler.cc line = 366

From: Rajakumar David <email address hidden>
Date: Wednesday, September 12, 2018 at 2:55 AM
To: Anantharamu Suryanarayana <email address hidden>, Ashok Singh R <email address hidden>, Sivakumar Ganapathy <email address hidden>
Cc: contrail-emea <email address hidden>, Contrail Systems Virtual Router Team <email address hidden>, support-private <email address hidden>, Federico Toci <email address hidden>, Alois Zellner <email address hidden>, Nikhil Bansal <email address hidden>, Slobodan Blatnjak <email address hidden>, Mladen Maric <email address hidden>, Assen Tarlov <email address hidden>
Subject: Re: Wrong VRF and DHCP lease after re-creating VM / NEC/Juniper Engineering review of ...

Revision history for this message
Venkatesh Velpula (vvelpula) wrote :
Download full text (19.3 KiB)

Hey Sathish,
     I am not sure how the setup got reimaged.. now i brought into the same state ..please see the deatils

[root@nodeg12 ~]# contrail-status
Pod Service Original Name State Status
                 redis contrail-external-redis running Up 23 minutes
analytics alarm-gen contrail-analytics-alarm-gen running Up 23 minutes
analytics api contrail-analytics-api running Up 23 minutes
analytics collector contrail-analytics-collector running Up 23 minutes
analytics nodemgr contrail-nodemgr running Up 23 minutes
analytics query-engine contrail-analytics-query-engine running Up 23 minutes
analytics snmp-collector contrail-analytics-snmp-collector running Up 23 minutes
analytics topology contrail-analytics-topology running Up 23 minutes
config api contrail-controller-config-api running Up 23 minutes
config device-manager contrail-controller-config-devicemgr running Up 23 minutes
config nodemgr contrail-nodemgr running Up 23 minutes
config schema contrail-controller-config-schema running Up 23 minutes
config svc-monitor contrail-controller-config-svcmonitor running Up 23 minutes
config-database cassandra contrail-external-cassandra running Up 23 minutes
config-database nodemgr contrail-nodemgr running Up 23 minutes
config-database rabbitmq contrail-external-rabbitmq running Up 23 minutes
config-database zookeeper contrail-external-zookeeper running Up 23 minutes
control control contrail-controller-control-control running Up 23 minutes
control dns contrail-controller-control-dns running Up 23 minutes
control named contrail-controller-control-named running Up 23 minutes
control nodemgr contrail-nodemgr running Up 23 minutes
database cassandra contrail-external-cassandra running Up 23 minutes
database kafka contrail-external-kafka running Up 23 minutes
database nodemgr contrail-nodemgr running Up 23 minutes
database zookeeper contrail-external-zookeeper running Up 23 minutes
kubernetes kube-manager contrail-kubernetes-kube-manager running Up 23 minutes
webui job contrail-controller-webui-job running Up 23 minutes
webui web contrail-controller-webui-web running Up 23 minutes

WARNING: container with original name 'contrail-external-redis' have Pod or Service empty. Pod: '' / Service: 'redis'. Please pass NODE_TYPE with pod name to container's env

== Contrail control ==
control: active
nodemgr: activ...

Revision history for this message
Venkatesh Velpula (vvelpula) wrote :

reserved the setup too...

[root@nodem4 ~]# cat /cs-shared/testbed_locks/testbed_k8s_multi_intf_ha_sanity_setup.py
vvelpula for debugging config api restart issue
[root@nodem4 ~]#

tags: added: releasenote
Revision history for this message
Sathish Holla (sathishholla) wrote :

This looks like a case of rabbitMQ race condition during initialization.

As part of test case, the docker service was restarted on all 3 controller nodes.
After docker restart, When the rabbitmq service is trying to come up, there's a race condition between two rabbitMQ nodes and both end up being master nodes.
Due to this, there is a rabbitMQ cluster partition.

This is a known rabbitMQ bug and rabbitMQ proposes the following workaround to handle such cases:
https://github.com/rabbitmq/rabbitmq-server/issues/1202

To implement the above workaround in Contrail, we will need to update the current rabbitMQ version from 3.6 to 3.7.

Thanks,
Sathish

Revision history for this message
Shivayogi Ugaji (shivayogi123) wrote :

Moving this to 5.0.3 as we do not want to upgrade rabbitmq in 5.0.2.

Hi Venky,

To recover this, we will need to do the following:
Login to the RabbitMQ docker on all the controllers, backup the directory “/var/lib/rabbitmq/mnesia/contrail@<Node_Name>” and delete it.
Once this folder is deleted in all the RabbitMQ dockers, restart the rabbitmq docker in one of the controller.
Wait for about 10 secs, and restart the rabbitmq docker on other two controller.
To verify that the rabbitmq cluster is correct, execute the following command and verify that all three nodes are present in the field “running_nodes” :
root@Config2:~/mnesia/contrail@Config2# rabbitmqctl cluster_status
Cluster status of node contrail@Config2
[{nodes,[{disc,[contrail@Config1,contrail@Config2,contrail@Config3]}]},
{running_nodes,[contrail@Config3,contrail@Config1,contrail@Config2]},
{cluster_name,<<"<email address hidden>">>},
{partitions,[]},
{alarms,[{contrail@Config3,[]},{contrail@Config1,[]},{contrail@Config2,[]}]}]

Thanks,
Sathish

From: Venkatesh Velpula <email address hidden>
Date: Tuesday, October 9, 2018 at 9:33 PM
To: Jeba Paulaiyan <email address hidden>, Abhay Joshi <email address hidden>, Shivayogi Ugaji <email address hidden>
Cc: Sathish Holla <email address hidden>, Sudheendra Rao <email address hidden>
Subject: Re: https://bugs.launchpad.net/juniperopenstack/r5.0/+bug/1793269

Hi Jeba,
       This is not happening always …but when it happens the impact is catastrophic..

Satish ,
       Could you please help us with recovery mechanism , we can release not the same for 5.0.2

Thanks
-Venky

From: Jeba Paulaiyan <email address hidden>
Date: Wednesday, October 10, 2018 at 5:33 AM
To: Abhay Joshi <email address hidden>, Shivayogi Ugaji <email address hidden>, Venkatesh Velpula <email address hidden>
Cc: Sathish Holla <email address hidden>, Madhava Rao Sudheendra Rao <email address hidden>
Subject: Re: https://bugs.launchpad.net/juniperopenstack/r5.0/+bug/1793269

Venky,

        This decision is based on assumption that this is not happening always and a race condition in RabbitMQ. Please feel free to disagree.

Thanks,
Jeba

From: Abhay Joshi <email address hidden>
Date: Tuesday, October 9, 2018 at 16:46
To: Shivayogi Ugaji <email address hidden>
Cc: Sathish Holla <email address hidden>, Jeba Paulaiyan <email address hidden>
Subject: Re: https://bugs.launchpad.net/juniperopenstack/r5.0/+bug/1793269

+ Jeba.

As discussed in bug scrub today, we will push this out to 5.1.0. Please update series accordingly.

Thanks,

Abhay

From: Shivayogi Ugaji <email address hidden>
Date: Tuesday, October 9, 2018 at 1:17 PM
To: Abhay Joshi <email address hidden>
Cc: Sathish Holla <email address hidden>
Subject: https://bugs.launchpad.net/juniperopenstack/r5.0/+bug/1793269

Hi Abhay,

This is due to a bug in rabbitMq implementation and the latest version of rabbitMq has the fix.
We need to update the rabbitMq version from 3.6 to 3.7. Any idea who can help with this.

Thanks
Shivayogi

Revision history for this message
Jeba Paulaiyan (jebap) wrote :

Hi Venky,

I noticed that the previous workaround was wrong. Please find updated instructions (see the change in step 2):
To recover this, we will need to do the following:
Login to the RabbitMQ docker on all the controllers, backup the directory “/var/lib/rabbitmq/mnesia/contrail@<Node_Name>” and delete it.
Once this folder is deleted in all the RabbitMQ dockers, Stop the rabbitmq docker on all the controllers. Start the rabbitmq docker in only one of the controller.
Wait for about 10 secs, and restart the rabbitmq docker on other two controller.
To verify that the rabbitmq cluster is correct, execute the following command and verify that all three nodes are present in the field “running_nodes” :
root@Config2:~/mnesia/contrail@Config2# rabbitmqctl cluster_status
Cluster status of node contrail@Config2
[{nodes,[{disc,[contrail@Config1,contrail@Config2,contrail@Config3]}]},
{running_nodes,[contrail@Config3,contrail@Config1,contrail@Config2]},
{cluster_name,<<"<email address hidden>">>},
{partitions,[]},
{alarms,[{contrail@Config3,[]},{contrail@Config1,[]},{contrail@Config2,[]}]}]

Thanks,
Sathish

Revision history for this message
Jeba Paulaiyan (jebap) wrote :

Notes:

k8s pod creation fails after config api restart

To recover this, we will need to do the following:
Login to the RabbitMQ docker on all the controllers, backup the directory “/var/lib/rabbitmq/mnesia/contrail@<Node_Name>” and delete it.
Once this folder is deleted in all the RabbitMQ dockers, Stop the rabbitmq docker on all the controllers. Start the rabbitmq docker in only one of the controller.
Wait for about 10 secs, and restart the rabbitmq docker on other two controller.
To verify that the rabbitmq cluster is correct, execute the following command and verify that all three nodes are present in the field “running_nodes” :
root@Config2:~/mnesia/contrail@Config2# rabbitmqctl cluster_status
Cluster status of node contrail@Config2
[{nodes,[{disc,[contrail@Config1,contrail@Config2,contrail@Config3]}]},
{running_nodes,[contrail@Config3,contrail@Config1,contrail@Config2]},
{cluster_name,<<"<email address hidden>">>},
{partitions,[]},
{alarms,[{contrail@Config3,[]},{contrail@Config1,[]},{contrail@Config2,[]}]}]

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.