Juniper Openstack

k8s:pod creation fails after the config api restart

Series r5.0
Bug #1793269

Bug #1793269 reported by Venkatesh Velpula on 2018-09-19

This bug affects 1 person

	Status	Importance	Assigned to	Milestone
Juniper Openstack	Status tracked in Trunk
R5.0	Triaged	High	Sathish Holla	Juniper Openstack r5.0.3
Trunk	Triaged	High	Sathish Holla	Juniper Openstack r5.1.0

Bug Description

After restart of the config api the pod creation fails .we hit this issue intermittently ,but if we hit this issue all the testcases will fail after teh config api restart .

right now i kept the setup intact in problem state ..could you please look at it.

Build :5.1.0-250
Deployment :Ansible_deployer
HOST OS: CENTOS7.5
=======================

Topology
==================
vrouter +k8s_node:

ip: nodec60
ip: nodec61

config +control++kubemanager:

      ip: nodeg12(k8s_master)
      ip: nodeg31
      ip: nodec58

on agent
========
018-09-19 Wed 13:08:05:766.554 IST nodec60 [Thread 140180966799104, Pid 1807]: [SYS_INFO]: EndpointSecurityStats: name = default-global-system-config:nodec60:vhost0 eps= [ [ _iter106->first = 00000000-0000-0000-0000-000000000001 [ workload = client= [ [ [ app = tier = site = deployment = labels = custom_tags = remote_app_id = remote_tier_id = remote_site_id = remote_deployment_id = remote_label_ids = remote_custom_tag_ids = remote_prefix = remote_vn = default-domain:default-project:ip-fabric local_vn = default-domain:default-project:ip-fabric added = 0 deleted = 0 active = 7 in_bytes = 6291 out_bytes = 50500 in_pkts = 81 out_pkts = 94 action = pass ], ] ] server= [ [ [ app = tier = site = deployment = labels = custom_tags = remote_app_id = remote_tier_id = remote_site_id = remote_deployment_id = remote_label_ids = remote_custom_tag_ids = remote_prefix = remote_vn = default-domain:default-project:ip-fabric local_vn = default-domain:default-project:ip-fabric added = 0 deleted = 0 active = 7 in_bytes = 50500 out_bytes = 6291 in_pkts = 94 out_pkts = 81 action = pass ], ] ] ], ] ] file = controller/src/vnsw/agent/uve/interface_uve_stats_table.cc line = 135
2018-09-19 Wed 13:08:07:439.498 IST nodec60 [Thread 140180958402304, Pid 1807]: SANDESH: Sending: LEVEL: [ INVALID ] -> [ SYS_DEBUG ] : 2053
2018-09-19 Wed 13:08:07:439.910 IST nodec60 [Thread 140180958402304, Pid 1807]: SANDESH: Sending: LEVEL: [ SYS_DEBUG ] -> [ INVALID ] : 0
2018-09-19 Wed 13:08:35:767.377 IST nodec60 [Thread 140180970997504, Pid 1807]: [SYS_INFO]: EndpointSecurityStats: name = default-global-system-config:nodec60:vhost0 eps= [ [ _iter106->first = 00000000-0000-0000-0000-000000000001 [ workload = client= [ [ [ app = tier = site = deployment = labels = custom_tags = remote_app_id = remote_tier_id = remote_site_id = remote_deployment_id = remote_label_ids = remote_custom_tag_ids = remote_prefix = remote_vn = default-domain:default-project:ip-fabric local_vn = default-domain:default-project:ip-fabric added = 0 deleted = 0 active = 7 in_bytes = 6446 out_bytes = 46142 in_pkts = 84 out_pkts = 86 action = pass ], ] ] server= [ [ [ app = tier = site = deployment = labels = custom_tags = remote_app_id = remote_tier_id = remote_site_id = remote_deployment_id = remote_label_ids = remote_custom_tag_ids = remote_prefix = remote_vn = default-domain:default-project:ip-fabric local_vn = default-domain:default-project:ip-fabric added = 0 deleted = 0 active = 7 in_bytes = 46142 out_bytes = 6446 in_pkts = 86 out_pkts = 84 action = pass ], ] ] ], ] ] file = controller/src/vnsw/agent/uve/interface_uve_stats_table.cc line = 135
2018-09-19 Wed 13:08:37:440.518 IST nodec60 [Thread 140180954203904, Pid 1807]: SANDESH: Sending: LEVEL: [ INVALID ] -> [ SYS_DEBUG ] : 2185
2018-09-19 Wed 13:08:37:440.917 IST nodec60 [Thread 140180954203904, Pid 1807]: SANDESH: Sending: LEVEL: [ SYS_DEBUG ] -> [ INVALID ] : 0

cni logs
=========
E : 27646 : 2018/09/19 13:10:30 contrail-kube-cni.go:68: Failed processing Add command.
I : 27880 : 2018/09/19 13:10:32 contrail-kube-cni.go:53: Came in Add for container 86d6db19312882197c71999e7fb13f51dd6866ab0310061655ff63e91f90acd5
I : 27880 : 2018/09/19 13:10:32 contrail-kube-cni.go:41: getPodInfo success. container-id 86d6db19312882197c71999e7fb13f51dd6866ab0310061655ff63e91f90acd5 uuid 1ad2a993-bbdf-11e8-88fd-002590c476a0 name test-75c49697d7-lq7bs
I : 27880 : 2018/09/19 13:10:32 cni.go:88: ContainerID : 86d6db19312882197c71999e7fb13f51dd6866ab0310061655ff63e91f90acd5
I : 27880 : 2018/09/19 13:10:32 cni.go:89: NetNS : /proc/27836/ns/net
I : 27880 : 2018/09/19 13:10:32 cni.go:90: Container Ifname : eth0
I : 27880 : 2018/09/19 13:10:32 cni.go:91: Args : IgnoreUnknown=1;K8S_POD_NAMESPACE=default;K8S_POD_NAME=test-75c49697d7-lq7bs;K8S_POD_INFRA_CONTAINER_ID=86d6db19312882197c71999e7fb13f51dd6866ab0310061655ff63e91f90acd5
I : 27880 : 2018/09/19 13:10:32 cni.go:92: CNI VERSION : 0.2.0
I : 27880 : 2018/09/19 13:10:32 cni.go:93: MTU : 1500
I : 27880 : 2018/09/19 13:10:32 cni.go:94: Config File : {"cniVersion":"0.2.0","contrail":{"config-dir":"/var/lib/contrail/ports/vm","log-file":"/var/log/contrail/cni/opencontrail.log","log-level":"4","poll-retries":15,"poll-timeout":5,"vrouter-ip":"127.0.0.1","vrouter-port":9091},"name":"contrail-k8s-cni","type":"contrail-k8s-cni"}
I : 27880 : 2018/09/19 13:10:32 cni.go:95: &{cniArgs:0xc4202ab340 Mode:k8s VifType:veth VifParent:eth0 LogDir:/var/log/contrail/cni LogFile:/var/log/contrail/cni/opencontrail.log LogLevel:4 Mtu:1500 ContainerUuid:1ad2a993-bbdf-11e8-88fd-002590c476a0 ContainerName:test-75c49697d7-lq7bs ContainerVn: VRouter:{Server:127.0.0.1 Port:9091 Dir:/var/lib/contrail/ports/vm PollTimeout:5 PollRetries:15 containerId: containerUuid: containerVn: httpClient:0xc4201a3710}}
I : 27880 : 2018/09/19 13:10:32 vrouter.go:446: {Server:127.0.0.1 Port:9091 Dir:/var/lib/contrail/ports/vm PollTimeout:5 PollRetries:15 containerId: containerUuid: containerVn: httpClient:0xc4201a3710}
I : 27880 : 2018/09/19 13:10:32 vrouter.go:79: VRouter request. Operation : GET Url : http://127.0.0.1:9091/vm-cfg/1ad2a993-bbdf-11e8-88fd-002590c476a0
E : 27880 : 2018/09/19 13:10:32 vrouter.go:147: Failed HTTP Get operation. Return code 404
I : 27880 : 2018/09/19 13:10:32 vrouter.go:181: Iteration 0 : Get vrouter failed

[root@nodec60 contrail]# contrail-status
Pod Service Original Name State Status
vrouter agent contrail-vrouter-agent running Up 19 hours
vrouter nodemgr contrail-nodemgr running Up 19 hours

vrouter kernel module is PRESENT
== Contrail vrouter ==
nodemgr: active
agent: active

[root@nodec60 contrail]

config api
===========
u'request-id': u'req-d432efc2-1912-42dc-93e6-fd78fe859311',
u'type': u'virtual_machine_interface',
u'uuid': u'1af781aa-bbdf-11e8-b7ef-002590c55f6a'}
09/19/2018 01:09:13 PM [contrail-api] [DEBUG]: Add uve <default-domain:k8s-default:test-75c49697d7-lq7bs__1af781aa-bbdf-11e8-b7ef-002590c55f6a, 1873> in the [ObjectVMITable:ContrailConfigTrace] map
09/19/2018 01:09:13 PM [contrail-api] [INFO]: __default__ [SYS_INFO]: VncApiLatencyStatsLog: node_name = issu-vm6 api_latency_stats = << operation_type = MULTIGET application = CASSANDRA response_time_in_usec = 6051 response_size = 0 identifier = req-d432efc2-1912-42dc-93e6-fd78fe859311 >>
09/19/2018 01:09:13 PM [contrail-api] [INFO]: __default__ [SYS_INFO]: VncApiConfigLog: api_log = << identifier_uuid = 1af781aa-bbdf-11e8-b7ef-002590c55f6a object_type = virtual_machine_interface identifier_name = default-domain:k8s-default:test-75c49697d7-lq7bs__1af781aa-bbdf-11e8-b7ef-002590c55f6a url = http://127.0.0.1/ref-update operation = ref-update domain = default-domain >>
09/19/2018 01:09:13 PM [contrail-api] [INFO]: __default__ [SYS_INFO]: VncApiLatencyStatsLog: node_name = issu-vm6 api_latency_stats = << operation_type = MULTIGET application = CASSANDRA response_time_in_usec = 1422 response_size = 0 identifier = req-d432efc2-1912-42dc-93e6-fd78fe859311 >>
09/19/2018 01:09:13 PM [contrail-api] [INFO]: __default__ [SYS_INFO]: VncApiLatencyStatsLog: node_name = issu-vm6 api_latency_stats = << operation_type = MULTIGET application = CASSANDRA response_time_in_usec = 1489 response_size = 0 identifier = req-d432efc2-1912-42dc-93e6-fd78fe859311 >>
09/19/2018 01:09:13 PM [contrail-api] [INFO]: __default__ [SYS_INFO]: VncApiLatencyStatsLog: node_name = issu-vm6 api_latency_stats = << operation_type = MULTIGET application = CASSANDRA response_time_in_usec = 899 response_size = 0 identifier = req-d432efc2-1912-42dc-93e6-fd78fe859311 >>
09/19/2018 01:09:13 PM [contrail-api] [INFO]: __default__ [SYS_INFO]: VncApiLatencyStatsLog: node_name = issu-vm6 api_latency_stats = << operation_type = SEND application = CASSANDRA response_time_in_usec = 1413 response_size = 0 identifier = req-d432efc2-1912-42dc-93e6-fd78fe859311 >>
09/19/2018 01:09:13 PM [contrail-api] [INFO]: __default__ [SYS_INFO]: VncApiLatencyStatsLog: node_name = issu-vm6 api_latency_stats = << operation_type = MULTIGET application = CASSANDRA response_time_in_usec = 877 response_size = 0 identifier = req-d432efc2-1912-42dc-93e6-fd78fe859311 >>
09/19/2018 01:09:13 PM [contrail-api] [INFO]: __default__ [SYS_INFO]: VncApiStatsLog: api_stats = << operation_type = POST user = useragent = nodeg31:/usr/bin/contrail-kube-manager remote_ip = 10.204.217.71 domain_name = default-domain project_name = default-project object_type = virtual_machine_interface response_time_in_usec = 42966 response_size = 529 resp_code = 200 >>
09/19/2018 01:09:13 PM [contrail-api] [DEBUG]: __default__ [SYS_DEBUG]: VncApiDebug: Notification Message: {u'fq_name': [u'default-domain',
              u'k8s-default',
              u'test-75c49697d7-lq7bs__1af781aa-bbdf-11e8-b7ef-002590c55f6a'],
u'oper': u'UPDATE',
u'request-id': u'req-d432efc2-1912-42dc-93e6-fd78fe859311',
u'type': u'virtual_machine_interface',
u'uuid': u'1af781aa-bbdf-11e8-b7ef-002590c55f6a'}
09/19/2018 01:09:13 PM [contrail-api] [INFO]: __default__ [SYS_INFO]: VncApiLatencyStatsLog: node_name = issu-vm6 api_latency_stats = << operation_type = MULTIGET application = CASSANDRA response_time_in_usec = 1117 response_size = 0 identifier = req-d432efc2-1912-42dc-93e6-fd78fe859311 >>
09/19/2018 01:09:14 PM [contrail-api] [DEBUG]: __default__ [SYS_DEBUG]: VncApiDebug: Notification Message: {u'fq_name': [u'default-domain',
              u'k8s-default',
              u'test-75c49697d7-lq7bs__1af781aa-bbdf-11e8-b7ef-002590c55f6a'],
u'oper': u'UPDATE',
u'request-id': u'req-c2ac9e16-9f35-4e4f-92c5-4f26372e9022',
u'type': u'virtual_machine_interface',
u'uuid': u'1af781aa-bbdf-11e8-b7ef-002590c55f6a'}
09/19/2018 01:09:14 PM [contrail-api] [INFO]: __default__ [SYS_INFO]: VncApiLatencyStatsLog: node_name = issu-vm6 api_latency_stats = << operation_type = MULTIGET application = CASSANDRA response_time_in_usec = 1219 response_size = 0 identifier = req-c2ac9e16-9f35-4e4f-92c5-4f26372e9022 >>
09/19/2018 01:09:14 PM [contrail-api] [INFO]: __default__ [SYS_INFO]: VncApiLatencyStatsLog: node_name = issu-vm6 api_latency_stats = << operation_type = MULTIGET application = CASSANDRA response_time_in_usec = 1144 response_size = 0 identifier = req-d432efc2-1912-42dc-93e6-fd78fe859311 >>
09/19/2018 01:09:14 PM [contrail-api] [INFO]: __default__ [SYS_INFO]: VncApiStatsLog: api_stats = << operation_type = GET user = useragent = nodeg31:/usr/bin/contrail-kube-manager remote_ip = 10.204.217.71 domain_name = default-domain project_name = default-project object_type = virtual_machine_interface response_time_in_usec = 2405 response_size = 2666 resp_code = 200 >>
09/19/2018 01:09:14 PM [contrail-api] [DEBUG]: __default__ [SYS_DEBUG]: VncApiDebug: Notification Message: {u'fq_name': [u'test-75c49697d7-lq7bs__1b02027e-bbdf-11e8-b7ef-002590c55f6a'],
u'obj_dict': {u'display_name': u'test-75c49697d7-lq7bs__1b02027e-bbdf-11e8-b7ef-002590c55f6a',
               u'fq_name': [u'test-75c49697d7-lq7bs__1b02027e-bbdf-11e8-b7ef-002590c55f6a'],
               u'id_perms': {u'created': u'2018-09-19T07:39:14.035968',
                             u'creator': None,
                             u'description': None,
                             u'enable': True,
                             u'last_modified': u'2018-09-19T07:39:14.035968',
                             u'permissions': {u'group': u'cloud-admin-group',
                                              u'group_access': 7,
                                              u'other_access': 7,
                                              u'owner': u'cloud-admin',
                                              u'owner_access': 7},
                             u'user_visible': True,
                             u'uuid': {u'uuid_lslong': 13253812389717303146L,
                                       u'uuid_mslong': 1946120732318568936}},
               u'instance_ip_address': u'10.47.255.251',
               u'perms2': {u'global_access': 0,
                           u'owner': u'cloud-admin',
                           u'owner_access': 7,
                           u'share': []},
               u'subnet_uuid': u'eff1f49d-cd1b-459b-a5d8-31a54440f83f',
               u'uuid': u'1b02027e-bbdf-11e8-b7ef-002590c55f6a',
               u'virtual_machine_interface_refs': [{u'to': [u'default-domain',
                                                            u'k8s-default',
                                                            u'test-75c49697d7-lq7bs__1af781aa-bbdf-11e8-b7ef-002590c55f6a'],
                                                    u'uuid': u'1af781aa-bbdf-11e8-b7ef-002590c55f6a'}],
               u'virtual_network_refs': [{u'to': [u'default-domain',
                                                  u'k8s-default',
                                                  u'k8s-default-pod-network'],
                                          u'uuid': u'c2ac9d50-27bb-4d18-b6c3-715bc88506a0'}]},

== Contrail control ==
control: active
nodemgr: active
named: active
dns: active

== Contrail config-database ==
nodemgr: initializing (Disk for DB is too low. )
zookeeper: active
rabbitmq: active
cassandra: active

== Contrail kubernetes ==
kube-manager: active

== Contrail database ==
kafka: active
nodemgr: initializing (Disk for DB is too low. )
zookeeper: active
cassandra: active

== Contrail analytics ==
snmp-collector: active
query-engine: active
api: active
alarm-gen: active
nodemgr: active
collector: active
topology: active

== Contrail webui ==
web: active
job: active

== Contrail config ==
svc-monitor: active
nodemgr: active
device-manager: active
api: active
schema: active

See original description

Tags:

Revision history for this message

Venkatesh Velpula (vvelpula) wrote on 2018-09-19:

contrail-api.log.2 Edit (4.8 MiB, text/plain)

description:

updated

Revision history for this message

Venkatesh Velpula (vvelpula) wrote on 2018-09-19:

attached contrail-api log while the config api restart event has occure

Sachchidanand Vaidya (vaidyasd) on 2018-09-19

Changed in juniperopenstack:
assignee:	Sachchidanand Vaidya (vaidyasd) → Venkatraman Venkatapathy (vvenkatapath)

Revision history for this message

Venkatraman Venkatapathy (vvenkatapath) wrote on 2018-09-19:

Taking a look. Please keep the cluster in the same state

Revision history for this message

Venkatraman Venkatapathy (vvenkatapath) wrote on 2018-09-20:

On debugging, looks like the config node is unaware of this pod at that moment. So no entry is trickled to control node and the agent. It could be a case of either rabbitMq notification or config node processing after the config api reset

Changed in juniperopenstack:
assignee:	Venkatraman Venkatapathy (vvenkatapath) → Shivayogi Ugaji (shivayogi123)

Shivayogi Ugaji (shivayogi123) on 2018-09-20

Changed in juniperopenstack:
assignee:	Shivayogi Ugaji (shivayogi123) → Sathish Holla (sathishholla)

Revision history for this message

Sathish Holla (sathishholla) wrote on 2018-09-21:

It looks like there's a rabbitMQ cluster partition on the controllers.
Active schema exists at Nodeg31. The rabbitMQ on this node is not part of the RabbitMQ cluster where other two nodes are part of.

root@nodeg12:/# rabbitmqctl cluster_status
Cluster status of node contrail@nodeg12
[{nodes,[{disc,[contrail@nodec58,contrail@nodeg12]}]},
{running_nodes,[contrail@nodec58,contrail@nodeg12]}, <==== Here, only nodec58 and nodeg12 are part of cluster
{cluster_name,<<"<email address hidden>">>},
{partitions,[]},
{alarms,[{contrail@nodec58,[]},{contrail@nodeg12,[]}]}]
root@nodeg12:/#

root@nodeg31:/# rabbitmqctl cluster_status
Cluster status of node contrail@nodeg31
[{nodes,[{disc,[contrail@nodeg31]}]},
{running_nodes,[contrail@nodeg31]}, <==== Here, only nodeg31 is part of the cluster.
{cluster_name,<<"<email address hidden>">>},
{partitions,[]},
{alarms,[{contrail@nodeg31,[]}]}]
root@nodeg31:/#

root@nodec58:/# rabbitmqctl cluster_status
Cluster status of node contrail@nodec58
[{nodes,[{disc,[contrail@nodec58,contrail@nodeg12]}]},
{running_nodes,[contrail@nodeg12,contrail@nodec58]},
{cluster_name,<<"<email address hidden>">>},
{partitions,[]},
{alarms,[{contrail@nodeg12,[]},{contrail@nodec58,[]}]}]
root@nodec58:/#

Revision history for this message

Sathish Holla (sathishholla) wrote on 2018-09-21:

From the logs, it also look like the problem with RabbitMQ starts at around 09/18/2018 06:10:45 PM, which occurs before the config_api was rebooted at around 09/19/2018 01:09:13 PM.

To recover, the rabbitMQ cluster will need to be brought back to healthy state on all three controller nodes.

Revision history for this message

Venkatesh Velpula (vvelpula) wrote on 2018-09-21:

Hey Sathish ,

I am recreating the setup and let you know .

thanks
-Venky

Shivayogi Ugaji (shivayogi123) on 2018-09-25

Changed in juniperopenstack:
assignee:	Sathish Holla (sathishholla) → Venkatesh Velpula (vvelpula)

Revision history for this message

Venkatesh Velpula (vvelpula) wrote on 2018-10-01:

HI Sathish ..now the setup is in problem state .could you please take a look

thanks
-Venky

Changed in juniperopenstack:
assignee:	Venkatesh Velpula (vvelpula) → Sathish Holla (sathishholla)

Revision history for this message

Sathish Holla (sathishholla) wrote on 2018-10-01:

Download full text (5.0 KiB)

Hi Venkatesh,

I don't see config installed on any of the nodeg12(k8s_master), nodeg31, nodec58

Please see output of docker ps and contrail-status attached.
Please let me know the new IPs if they have been installed elsewhere.

Thanks,
Sahtish

[root@nodeg12 ~]# cat /etc/contrail/common.env
KUBERNETES_API_SERVER=77.77.1.20
VROUTER_GATEWAY=77.77.1.100
TTY=True
LOG_LEVEL=SYS_DEBUG
KUBERNETES_IP_FABRIC_SUBNETS=77.77.1.160/27
CONTAINER_REGISTRY=10.204.217.152:5000
CONTRAIL_VERSION=queens-5.0-275
STDIN_OPEN=True
METADATA_PROXY_SECRET=c0ntrail123
WEBUI_NODES=10.204.217.71,10.204.217.52,10.204.217.98
CONTROLLER_NODES=10.204.217.52,10.204.217.71,10.204.217.98 <==== SATHISH: I checked these 3 nodes.
KUBERNETES_API_NODES=77.77.1.20
REGISTRY_PRIVATE_INSECURE=True
VNC_CURL_LOG_NAME=vnc_logs_k8s.log
RABBITMQ_NODE_PORT=5673
CONTROL_NODES=77.77.1.20,77.77.1.30,77.77.1.11
KUBERNETES_PUBLIC_FIP_POOL={u'project': u'k8s-default', u'domain': u'default-domain', u'name': u'__fip_pool_public__', u'network': u'__public__'}
CLOUD_ORCHESTRATOR=kubernetes
JVM_EXTRA_OPTS=-Xms1g -Xmx2g
[root@nodeg12 ~]#
[root@nodeg12 ~]#
[root@nodeg12 ~]#
[root@nodeg12 ~]#
[root@nodeg12 ~]# ssh root@10.204.217.71
root@10.204.217.71's password:
Last login: Mon Oct 1 22:23:22 2018 from nodeg12.englab.juniper.net
[root@nodeg31 ~]# contrail-status
-bash: contrail-status: command not found
[root@nodeg31 ~]# docker ps <=== SATHISH: No config node container seen here. even with docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
[root@nodeg31 ~]# exit
logout
Connection to 10.204.217.71 closed.

Hi Venkatesh,

I don't see config installed on any of the nodeg12(k8s_master), nodeg31, nodec58

Please see output of docker ps and contrail-status attached. 
Please let me know the new IPs if they have been installed elsewhere.

Thanks,
Sahtish

[root@nodeg12 ~]# cat /etc/contrail/common.env 
KUBERNETES_API_SERVER=77.77.1.20
VROUTER_GATEWAY=77.77.1.100
TTY=True
LOG_LEVEL=SYS_DEBUG
KUBERNETES_IP_FABRIC_SUBNETS=77.77.1.160/27
CONTAINER_REGISTRY=10.204.217.152:5000
CONTRAIL_VERSION=queens-5.0-275
STDIN_OPEN=True
METADATA_PROXY_SECRET=c0ntrail123
WEBUI_NODES=10.204.217.71,10.204.217.52,10.204.217.98
CONTROLLER_NODES=10.204.217.52,10.204.217.71,10.204.217.98  <==== SATHISH: I checked these 3 nodes. 
KUBERNETES_API_NODES=77.77.1.20
REGISTRY_PRIVATE_INSECURE=True
VNC_CURL_LOG_NAME=vnc_logs_k8s.log
RABBITMQ_NODE_PORT=5673
CONTROL_NODES=77.77.1.20,77.77.1.30,77.77.1.11
KUBERNETES_PUBLIC_FIP_POOL={u'project': u'k8s-default', u'domain': u'default-domain', u'name': u'__fip_pool_public__', u'network': u'__public__'}
CLOUD_ORCHESTRATOR=kubernetes
JVM_EXTRA_OPTS=-Xms1g -Xmx2g
[root@nodeg12 ~]# 
[root@nodeg12 ~]# 
[root@nodeg12 ~]# 
[root@nodeg12 ~]# 
[root@nodeg12 ~]# ssh root@10.204.217.71
root@10.204.217.71's password: 
Last login: Mon Oct  1 22:23:22 2018 from nodeg12.englab.juniper.net
[root@nodeg31 ~]# contrail-status 
-bash: contrail-status: command not found
[root@nodeg31 ~]# docker ps <=== SATHISH: No config node container seen here. even with docker ps -a 
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
[root@nodeg31 ~]# exit
logout
Connection to 10.204.217.71 closed.

[root@nodeg31 ~]# ssh root@10.204.217.52
root@10.204.217.52's password: 
Last login: Mon Oct  1 22:26:08 2018 from nodeg31.englab.juniper.net
[root@nodeg12 ~]# docker ps   <=== SATHISH: No config node container seen here. even with docker ps -a 
CONTAINER ID        IMAGE                                                    COMMAND                  CREATED             STATUS              PORTS               NAMES
12042bf766f5        gcr.io/google_containers/kube-proxy-amd64                "/usr/local/bin/kube…"   2 hours ago         Up 2 hours                              k8s_kube-proxy_kube-proxy-cwkkr_kube-system_a7f6eaa7-c58a-11e8-8b10-002590c476a0_0
fc9749808a52        gcr.io/google_containers/pause-amd64:3.0                 "/pause"                 2 hours ago         Up 2 hours                              k8s_POD_kube-proxy-cwkkr_kube-system_a7f6eaa7-c58a-11e8-8b10-002590c476a0_0
47c4c9910ceb        gcr.io/google_containers/etcd-amd64                      "etcd --listen-clien…"   2 hours ago         Up 2 hours                              k8s_etcd_etcd-nodeg12_kube-system_7278f85057e8bf5cb81c9f96d3b25320_0
4c9bb533fa01        gcr.io/google_containers/kube-scheduler-amd64            "kube-scheduler --ad…"   2 hours ago         Up 2 hours                              k8s_kube-scheduler_kube-scheduler-nodeg12_kube-system_69c12074e336b0dbbd0a1666ce05226a_0
f98ce02b794e        gcr.io/google_containers/kube-apiserver-amd64            "kube-apiserver --pr…"   2 hours ago         Up 2 hours                              k8s_kube-apiserver_kube-apiserver-nodeg12_kube-system_ec458ab7efa8faec141df1b155afa078_0
f11e711e4107        gcr.io/google_containers/kube-controller-manager-amd64   "kube-controller-man…"   2 hours ago         Up 2 hours                              k8s_kube-controller-manager_kube-controller-manager-nodeg12_kube-system_6ce4df807e74afb1952d4c1cf8fec3e4_0
d93fba9c99cf        gcr.io/google_containers/pause-amd64:3.0                 "/pause"                 2 hours ago         Up 2 hours                              k8s_POD_kube-controller-manager-nodeg12_kube-system_6ce4df807e74afb1952d4c1cf8fec3e4_0
c20917010d2e        gcr.io/google_containers/pause-amd64:3.0                 "/pause"                 2 hours ago         Up 2 hours                              k8s_POD_kube-scheduler-nodeg12_kube-system_69c12074e336b0dbbd0a1666ce05226a_0
d1ec2e61ddac        gcr.io/google_containers/pause-amd64:3.0                 "/pause"                 2 hours ago         Up 2 hours                              k8s_POD_kube-apiserver-nodeg12_kube-system_ec458ab7efa8faec141df1b155afa078_0
e8fd92ff60bd        gcr.io/google_containers/pause-amd64:3.0                 "/pause"                 2 hours ago         Up 2 hours                              k8s_POD_etcd-nodeg12_kube-system_7278f85057e8bf5cb81c9f96d3b25320_0
[root@nodeg12 ~]# contrail-status
-bash: contrail-status: command not found
[root@nodeg12 ~]#

[root@nodeg12 ~]# ssh root@10.204.217.98
root@10.204.217.98's password: 
Last login: Mon Oct  1 22:22:00 2018 from nodeg12.englab.juniper.net
[root@nodec58 ~]# 
[root@nodec58 ~]# docker ps <=== SATHISH: No config node container seen here. even with docker ps -a 
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
[root@nodec58 ~]# contrail-status  
-bash: contrail-status: command not found
[root@nodec58 ~]# 
[root@nodec58 ~]#

Revision history for this message

Pramodh D'Souza (psdsouza) wrote on 2018-10-01: FW: Wrong VRF and DHCP lease after re-creating VM / NEC/Juniper Engineering review of Etisalat VeCPE project 2018-0806-0353 / LP786240

#10

Download full text (18.0 KiB)

This is the email from Raja earlier when I looked at it, I don’t know about the current repro.
In this he clearly mentions “have all interfaces in 65535 VRF”.
Pramodh

From: Pramodh D'Souza <email address hidden>
Date: Wednesday, September 12, 2018 at 4:43 PM
To: Rajakumar David <email address hidden>
Cc: Yuvaraja Mariappan <email address hidden>
Subject: Re: Wrong VRF and DHCP lease after re-creating VM / NEC/Juniper Engineering review of Etisalat VeCPE project 2018-0806-0353 / LP786240

Quick observation:
Noticed these logs in sdnvcpe04cn_Snh_SandeshTraceRequest_Config.log corresponding to c328ec46-2c35-4211-b560-ea5b309cdc8d.

ConfigAddPortEnqueue: op = Add port_uuid = 40166ed9-592e-4e90-9622-e8edaeccabf8 instance_uuid = c328ec46-2c35-4211-b560-ea5b309cdc8d vn_uuid = 23c6b87f-b4ec-4b53-a486-305914f44587 ip_address = 10.101.178.134 system_name = tap40166ed9-59 mac_address = 02:40:24:37:aa:16 display_name = VM_IPFE_M_VNFC_Record___0000029683_000973 tx_vlan_id = -1 rx_vlan_id = -1 vm_project_uuid = f6d14e50-ca94-415a-b8df-fbd418db6783 port_type = CfgIntVMPort ip6_address = None file = controller/src/vnsw/agent/port_ipc/port_ipc_handler.cc line = 366

ConfigAddPortEnqueue: op = Add port_uuid = f9644a68-6646-43aa-9b5d-323751bed387 instance_uuid = c328ec46-2c35-4211-b560-ea5b309cdc8d vn_uuid = b12a36da-4383-4d80-801a-ad81ba
a3929c ip_address = 1.2.3.252 system_name = tapf9644a68-66 mac_address = 02:5e:e9:80:21:ca display_name = VM_IPFE_M_VNFC_Record___0000029683_000973 tx_vlan_id = -1 rx_vlan_id = -1 vm_project_uuid = f6d14e50-ca94-415a-b8df-fbd418db6783 port_type = CfgIntVMPort ip6_address = None file = controller/src/vnsw/agent/port_ipc/port_ipc_handler.cc line = 366

ConfigAddPortEnqueue: op = Add port_uuid = 7a1ccbc9-de44-49fa-9
183-179383792538 instance_uuid = c328ec46-2c35-4211-b560-ea5b309cdc8d vn_uuid = feb5da8e-fb2b-412f-a96c-532542be5630 ip_address = 192.168.116.10 system_name = tap7a1ccbc9-de mac_address = 02:39:74:b4:5f:0b display_name = VM_IPFE_M_VNFC_Record___0000029683_000973 tx_vlan_id = -1 rx_vlan_id = -1 vm_project_uuid = f6d14e50-ca94-415a-b8df-fbd418db6783 port_type = CfgIntVMPort ip6_address = None file = controller/src/vnsw/agent/port_ipc/port_ipc_handler.cc line = 366

From: Rajakumar David <email address hidden>
Date: Wednesday, September 12, 2018 at 2:55 AM
To: Anantharamu Suryanarayana <email address hidden>, Ashok Singh R <email address hidden>, Sivakumar Ganapathy <email address hidden>
Cc: contrail-emea <email address hidden>, Contrail Systems Virtual Router Team <email address hidden>, support-private <email address hidden>, Federico Toci <email address hidden>, Alois Zellner <email address hidden>, Nikhil Bansal <email address hidden>, Slobodan Blatnjak <email address hidden>, Mladen Maric <email address hidden>, Assen Tarlov <email address hidden>
Subject: Re: Wrong VRF and DHCP lease after re-creating VM / NEC/Juniper Engineering review of ...

This is the email from Raja earlier when I looked at it, I don’t know about the current repro.
In this he clearly mentions “have all interfaces in 65535 VRF”.
Pramodh

From: Pramodh D'Souza <pdsouza@juniper.net>
Date: Wednesday, September 12, 2018 at 4:43 PM
To: Rajakumar David <rkdavid@juniper.net>
Cc: Yuvaraja Mariappan <ymariappan@juniper.net>
Subject: Re: Wrong VRF and DHCP lease after re-creating VM / NEC/Juniper Engineering review of Etisalat VeCPE project 2018-0806-0353 / LP786240

Quick observation:
Noticed these logs in sdnvcpe04cn_Snh_SandeshTraceRequest_Config.log corresponding to c328ec46-2c35-4211-b560-ea5b309cdc8d.

From: Rajakumar David <rkdavid@juniper.net>
Date: Wednesday, September 12, 2018 at 2:55 AM
To: Anantharamu Suryanarayana <anantha@juniper.net>, Ashok Singh R <ashoksr@juniper.net>, Sivakumar Ganapathy <sganapathy@juniper.net>
Cc: contrail-emea <contrail-emea@juniper.net>, Contrail Systems Virtual Router Team <dl-contrail-vnsw@juniper.net>, support-private <support-private@juniper.net>, Federico Toci <ftoci@juniper.net>, Alois Zellner <alois@juniper.net>, Nikhil Bansal <nikhilb@juniper.net>, Slobodan Blatnjak <sblatnjak@juniper.net>, Mladen Maric <mmaric@juniper.net>, Assen Tarlov <atarlov@juniper.net>
Subject: Re: Wrong VRF and DHCP lease after re-creating VM / NEC/Juniper Engineering review of Etisalat VeCPE project 2018-0806-0353 / LP786240

Hello Engineering,

Net cracker were able to recreate this issue in their lab. VM was running at the sdnvcpe05cn, Hypervisor was stopped and healing process started. Now VM is located (created at 08:51:21 EDT) on the sdnvcpe04cn and have all interfaces in 65535 VRF. Logs from all three controller node and introspects can be downloaded from below link https://jtacworkbench.juniper.net/api/download-manager/download?caseId=2018-0906-0624&attId=2018-0906-06240007092484

We are evaluating the logs and we will share our findings by end of the day, it will great if engineering looks into this.

Thanks,
Raja

[root@cr1 ~(vcpe1_offsite)]# nova show c328ec46-2c35-4211-b560-ea5b309cdc8d
+--------------------------------------+----------------------------------------------------------+
| Property                             | Value |
+--------------------------------------+----------------------------------------------------------+
| Access network                       | 192.168.116.10 |
| Network VXLAN-000796 network         | 1.2.3.252 |
| OS-DCF:diskConfig                    | MANUAL |
| OS-EXT-AZ:availability_zone          | nova |
| OS-EXT-SRV-ATTR:host                 | sdnvcpe04cn |
| OS-EXT-SRV-ATTR:hostname             | vm-ipfe-m-vnfc-record---0000029683-000973                |
| OS-EXT-SRV-ATTR:hypervisor_hostname  | sdnvcpe04cn.netcracker.com |
| OS-EXT-SRV-ATTR:instance_name        | instance-0000045d |
| OS-EXT-SRV-ATTR:kernel_id            | |
| OS-EXT-SRV-ATTR:launch_index         | 0 |
| OS-EXT-SRV-ATTR:ramdisk_id           | |
| OS-EXT-SRV-ATTR:reservation_id       | r-7066yr0r |
| OS-EXT-SRV-ATTR:root_device_name     | /dev/vda |
| OS-EXT-SRV-ATTR:user_data            | - |
| OS-EXT-STS:power_state               | 1 |
| OS-EXT-STS:task_state                | - |
| OS-EXT-STS:vm_state                  | active |
| OS-SRV-USG:launched_at               | 2018-09-06T12:51:21.000000 | (means 08:51:21 in EDT timezone)
| OS-SRV-USG:terminated_at             | - |
| VNF-MGMT network                     | 10.101.178.134 |
| accessIPv4                           | |
| accessIPv6                           | |
| config_drive                         | True |
| created                              | 2018-09-06T12:51:00Z |
| description                          | VM_IPFE_M_VNFC_Record___0000029683_000973                |
| flavor                               | a1.IPFE-M (ab1dadf0-7fa9-47d1-8e45-4de3280d5ec0)         |
| hostId                               | cec54b3fb09c5871dc83a867a2ea86f682a9e663cd747bba7c075692 |
| host_status                          | UP |
| id                                   | c328ec46-2c35-4211-b560-ea5b309cdc8d                     |
| image                                | ipfe-6541-1.0a (bb4e0a04-46bb-4811-b5d4-4b495793f31b)    |
| key_name                             | - |
| locked                               | False |
| metadata                             | {} |
| name                                 | VM_IPFE_M_VNFC_Record___0000029683_000973                |
| os-extended-volumes:volumes_attached | [] |
| progress                             | 0 |
| security_groups                      | default |
| status                               | ACTIVE |
| tenant_id                            | f6d14e50ca94415ab8dffbd418db6783                         |
| updated                              | 2018-09-06T12:51:21Z |
| user_id                              | f4495c9c491d4c4d951a3c9a734836b8                         |

From: Anantharamu Suryanarayana <anantha@juniper.net>
Date: Monday, August 27, 2018 at 5:44 PM
To: Ashok Singh R <ashoksr@juniper.net>, Mladen Maric <mmaric@juniper.net>, Sivakumar Ganapathy <sganapathy@juniper.net>, VinayKumar Tejavath <vtejavath@juniper.net>, Assen Tarlov <atarlov@juniper.net>, Slobodan Blatnjak <sblatnjak@juniper.net>, Nikhil Bansal <nikhilb@juniper.net>
Cc: contrail-emea <contrail-emea@juniper.net>, Contrail Systems Virtual Router Team <dl-contrail-vnsw@juniper.net>, support-private <support-private@juniper.net>, Federico Toci <ftoci@juniper.net>, Rajakumar David <rkdavid@juniper.net>, Himanshu Bahukhandi <bhimanshu@juniper.net>, JTAC-Contrail-Bng <JTAC-Contrail-Bng2@juniper.net>, Mahesh Subramaniam <mksubra@juniper.net>, Robert Horner <rhorner@juniper.net>, Tom Nakano <tnakano@juniper.net>, Yaoxuan Wang <yaoxuanw@juniper.net>, Soumit Mishra <soumitm@juniper.net>, Alois Zellner <alois@juniper.net>, Lin Wu <lwu@juniper.net>, "Selmane B. SLAMA" <selmane@juniper.net>, "Mohammed A. Hafez" <mhafez@juniper.net>, Vlad Stoica <vstoica@juniper.net>, Rakesh Dubey <rdubey@juniper.net>, Ragunath Subramaniam <ragunaths@juniper.net>, Ravi Prakash <raviprakash@juniper.net>, Ajay Pandey <ajaypa@juniper.net>, Masum Mir <mmir@juniper.net>, Umesh Kukreja <ukukreja@juniper.net>, PrasannaKumar Anandan Gajendran <panandan@juniper.net>, Suryanarayana MNV <mnv@juniper.net>, Stephane Blazy <sblazy@juniper.net>, Tony Sarathchandra <tsarath@juniper.net>, "KK (Kaushal) Agrawal" <kagrawal@juniper.net>, Ashbi Jose <ashbij@juniper.net>, Viswanath KJ <viswakj@juniper.net>, Ali Ahmed Shakir <ashakir@juniper.net>, Lester Andrade <lestera@juniper.net>, Soumit Mishra <soumitm@juniper.net>, Rudra Rugge <rrugge@juniper.net>, Sreelakshmi Sarva <ssarva@juniper.net>, Jeba Paulaiyan <jebap@juniper.net>
Subject: Re: Wrong VRF and DHCP lease after re-creating VM / NEC/Juniper Engineering review of Etisalat VeCPE project 2018-0806-0353 / LP786240

Hi,

If compute is powered off, then control-node takes at most 90 seconds (xmpp hold-time) to delete those routes. During this time, those routes will remain and will be used for forwarding. I presume there is no GR enabled here. If so, then routes will remain for even longer time.

Regards,
Ananth

________________________________
From: Ashok Singh R
Sent: Monday, August 27, 2018 1:31:25 AM
To: Mladen Maric; Sivakumar Ganapathy; VinayKumar Tejavath; Assen Tarlov; Slobodan Blatnjak; Nikhil Bansal
Cc: contrail-emea; Contrail Systems Virtual Router Team; support-private; Federico Toci; Rajakumar David; Himanshu Bahukhandi; JTAC-Contrail-Bng; Mahesh Subramaniam; Robert Horner; Tom Nakano; Yaoxuan Wang; Soumit Mishra; Alois Zellner; Lin Wu; Selmane B. SLAMA; Mohammed A. Hafez; Vlad Stoica; Rakesh Dubey; Ragunath Subramaniam; Ravi Prakash; Ajay Pandey; Masum Mir; Umesh Kukreja; PrasannaKumar Anandan Gajendran; Suryanarayana MNV; Stephane Blazy; Tony Sarathchandra; KK (Kaushal) Agrawal; Ashbi Jose; Viswanath KJ; Ali Ahmed Shakir; Lester Andrade; Soumit Mishra; Rudra Rugge; Sreelakshmi Sarva; Jeba Paulaiyan
Subject: Re: Wrong VRF and DHCP lease after re-creating VM / NEC/Juniper Engineering review of Etisalat VeCPE project 2018-0806-0353 / LP786240

+Nikhil

Hi Vinay,

I am not sure if the steps that you have provided is recommended/supported way.  The reason is since we have shutdown the compute, the vrouter-agent which is supposed to process the delete request for VM, will not be able to do so. So vrouter-agent wouldn’t have sent any delete/unsubscribe message to control-node.

Have added Nikhil to explain behavior from control-node in this scenario.

The issue that I had looked at earlier as part of 1786240 was that in which port didn’t have any VRF. Here we have wrong VRF?

Regards,

Ashok

From: Mladen Maric <mmaric@juniper.net>
Date: Monday, August 27, 2018 at 1:23 PM
To: Sivakumar Ganapathy <sganapathy@juniper.net>, VinayKumar Tejavath <vtejavath@juniper.net>, Ashok Singh R <ashoksr@juniper.net>, Assen Tarlov <atarlov@juniper.net>, Slobodan Blatnjak <sblatnjak@juniper.net>
Cc: contrail-emea <contrail-emea@juniper.net>, Contrail Systems Virtual Router Team <dl-contrail-vnsw@juniper.net>, support-private <support-private@juniper.net>, Federico Toci <ftoci@juniper.net>, Rajakumar David <rkdavid@juniper.net>, Himanshu Bahukhandi <bhimanshu@juniper.net>, JTAC-Contrail-Bng <JTAC-Contrail-Bng2@juniper.net>, Mahesh Subramaniam <mksubra@juniper.net>, Robert Horner <rhorner@juniper.net>, Tom Nakano <tnakano@juniper.net>, Yaoxuan Wang <yaoxuanw@juniper.net>, Soumit Mishra <soumitm@juniper.net>, Alois Zellner <alois@juniper.net>, Lin Wu <lwu@juniper.net>, "Selmane B. SLAMA" <selmane@juniper.net>, "Mohammed A. Hafez" <mhafez@juniper.net>, Vlad Stoica <vstoica@juniper.net>, Rakesh Dubey <rdubey@juniper.net>, Ragunath Subramaniam <ragunaths@juniper.net>, Ravi Prakash <raviprakash@juniper.net>, Ajay Pandey <ajaypa@juniper.net>, Masum Mir <mmir@juniper.net>, Umesh Kukreja <ukukreja@juniper.net>, PrasannaKumar Anandan Gajendran <panandan@juniper.net>, Suryanarayana MNV <mnv@juniper.net>, Stephane Blazy <sblazy@juniper.net>, Tony Sarathchandra <tsarath@juniper.net>, "KK (Kaushal) Agrawal" <kagrawal@juniper.net>, Ashbi Jose <ashbij@juniper.net>, Viswanath KJ <viswakj@juniper.net>, Ali Ahmed Shakir <ashakir@juniper.net>, Lester Andrade <lestera@juniper.net>, Soumit Mishra <soumitm@juniper.net>, Rudra Rugge <rrugge@juniper.net>, Sreelakshmi Sarva <ssarva@juniper.net>, Jeba Paulaiyan <jebap@juniper.net>
Subject: RE: Wrong VRF and DHCP lease after re-creating VM / NEC/Juniper Engineering review of Etisalat VeCPE project 2018-0806-0353 / LP786240

Hi all,

This seems to be (initial) development in progress (Partner testing their Orchestrator):

1. VNF ports are created separately to VMs
2. When orchestrator creates VM, it attaches existing interafaces to it

Our case was:
1. Forcefully poweroff compute node
2. Orchestrator starts VM healing. First it tries to use normal delete, if it fails it starts using force-delete and retries it until VM gets deleted (we use force-delete because normal delete usually doesn't help if VM is hosted on shutdowned compute node).
3. Once VM is deleted Orchestrator creates a new VM on another compute node and assignes old ports which were previosly assigned to the deleted VM.
4. When process is finished in some cases (may be 20-30%) created VM has such wrong VRF issue.

Based on available info, bug report has been marked as invalid:

https://bugs.launchpad.net/juniperopenstack/r4.1/+bug/1786240

Basically we concluded that this operation breaks connectivity between vrouter and control/config nodes and operation does not complete successfully – which is in line with operations customer is performing (force delete).

That bring next major question (BU/QA on CC, feedback from Partner Sales needed as well):

1.  Has Netcracker Orchestrator been validated/certified on Contrail platform (from JTACs perspective, it’s a black box with unknown input/outputs)?
  2.  Are described operations from above supported at all? Does not look like it, as we have force delete involved.

Regards,

Mladen

From: Sivakumar Ganapathy
Sent: Sunday, August 26, 2018 2:22 PM
To: VinayKumar Tejavath <vtejavath@juniper.net>; Ashok Singh R <ashoksr@juniper.net>; Assen Tarlov <atarlov@juniper.net>
Cc: contrail-emea <contrail-emea@juniper.net>; Contrail Systems Virtual Router Team <dl-contrail-vnsw@juniper.net>; support-private <support-private@juniper.net>; Federico Toci <ftoci@juniper.net>; Mladen Maric <mmaric@juniper.net>; Rajakumar David <rkdavid@juniper.net>; Himanshu Bahukhandi <bhimanshu@juniper.net>; JTAC-Contrail-Bng <JTAC-Contrail-Bng2@juniper.net>
Subject: Re: Wrong VRF and DHCP lease after re-creating VM

Vinaykumar,

Ashok had already provided the analysis and in the last email thread it was concluded that the problem is due to control node connection going down. Assen had requested more information and don’t see that provided

Anywhere.

Assen,

Do reach out to Ashok once you have the info.

========
Hi vRouter-Team,

Netcracker escalated the case as they hit the issue again and want to know how to prevent that behavior as they are running ATP for customer.

From the case notes I see that Ashok found that  in previous case the problem was the connection to control node.

“I checked the gcore file. The port was in inactive state because configuration was not downloaded. The config was not downloaded because connection to control node was down at that moment.”

The following logs confirm when the connection to control-node was established

2018-08-07 20:31:40.587568 AgentXmppSession: peer = "10.237.212.136" event = "READY" tree_builder = "NULL" message = "BGP peer ready." file = "controller/src/vnsw/agent/controller/controller_peer.cc" line = 1513

2018-08-07 20:31:40.674994 AgentXmppSession: peer = "10.237.212.138" event = "READY" tree_builder = "NULL" message = "BGP peer ready." file = "controller/src/vnsw/agent/controller/controller_peer.cc" line = 1513

“

Is it possible to revisit this issue once again as part of latest Netcracker/Etisalat escalation ?

@Curtis

We will need documentation for the underlay topology, KVM VM details, server specification and monitoring data in order to evaluate this properly

==========

Thanks,

Sivakumar

From: VinayKumar Tejavath <vtejavath@juniper.net<mailto:vtejavath@juniper.net>>
Date: Saturday, 25 August 2018 at 3:49 AM
To: Sivakumar Ganapathy <sganapathy@juniper.net<mailto:sganapathy@juniper.net>>, Ashok Singh R <ashoksr@juniper.net<mailto:ashoksr@juniper.net>>
Cc: contrail-emea <contrail-emea@juniper.net<mailto:contrail-emea@juniper.net>>, Contrail Systems Virtual Router Team <dl-contrail-vnsw@juniper.net<mailto:dl-contrail-vnsw@juniper.net>>, support-private <support-private@juniper.net<mailto:support-private@juniper.net>>, Federico Toci <ftoci@juniper.net<mailto:ftoci@juniper.net>>, Mladen Maric <mmaric@juniper.net<mailto:mmaric@juniper.net>>, Rajakumar David <rkdavid@juniper.net<mailto:rkdavid@juniper.net>>, Himanshu Bahukhandi <bhimanshu@juniper.net<mailto:bhimanshu@juniper.net>>, JTAC-Contrail-Bng <JTAC-Contrail-Bng2@juniper.net<mailto:JTAC-Contrail-Bng2@juniper.net>>
Subject: Wrong VRF and DHCP lease after re-creating VM

Hi Sivakumar / Ashok,

Customer escalated this case again and LP bug 1786240 already marked as hot-case. Latest logs are uploaded to the JTAC case and the same has been updated in LP bug comments.

Customer followed below steps to recreate the problem

1. Forcefully power off compute node

2. Orchestrator starts VM healing. First it tries to use normal delete, if it fails it starts using force-delete and retries it until VM gets deleted (we use force-delete because normal delete usually doesn't help if VM is hosted on shutdowned compute node).

3. Once VM is deleted Orchestrator creates a new VM on another compute node and assignes old ports which were previously assigned to the deleted VM.

4. When process is finished in some cases (may be 20-30%) created VM has such wrong VRF issue.

Customer will be available on Monday morning in Moscow time zone (10:00 AM) if any troubleshooting session required. I am copying BLR and EMEA contrail JTAC teams in this email who can coordinate session with customer.

Thanks & Regards,

VinayKumar Tejavath

Revision history for this message

Venkatesh Velpula (vvelpula) wrote on 2018-10-05:

#11

Download full text (19.3 KiB)

Hey Sathish,
I am not sure how the setup got reimaged.. now i brought into the same state ..please see the deatils

[root@nodeg12 ~]# contrail-status
Pod Service analytics alarm-gen analytics api analytics collector analytics nodemgr analytics query-engine analytics analytics topology config api config config nodemgr config schema config svc-monitor config-database cassandra config-database nodemgr config-database rabbitmq config-database zookeeper control control control dns control named control nodemgr database cassandra database kafka database nodemgr database zookeeper kubernetes kube-manager webui job webui web Original Name State Status
redis contrail-external-redis running Up 23 minutes
contrail-analytics-alarm-gen running Up 23 minutes
contrail-analytics-api running Up 23 minutes
contrail-analytics-collector running Up 23 minutes
contrail-nodemgr running Up 23 minutes
contrail-analytics-query-engine running Up 23 minutes
snmp-collector contrail-analytics-snmp-collector running Up 23 minutes
contrail-analytics-topology running Up 23 minutes
contrail-controller-config-api running Up 23 minutes
device-manager contrail-controller-config-devicemgr running Up 23 minutes
contrail-nodemgr running Up 23 minutes
contrail-controller-config-schema running Up 23 minutes
contrail-controller-config-svcmonitor running Up 23 minutes
contrail-external-cassandra running Up 23 minutes
contrail-nodemgr running Up 23 minutes
contrail-external-rabbitmq running Up 23 minutes
contrail-external-zookeeper running Up 23 minutes
contrail-controller-control-control running Up 23 minutes
contrail-controller-control-dns running Up 23 minutes
contrail-controller-control-named running Up 23 minutes
contrail-nodemgr running Up 23 minutes
contrail-external-cassandra running Up 23 minutes
contrail-external-kafka running Up 23 minutes
contrail-nodemgr running Up 23 minutes
contrail-external-zookeeper running Up 23 minutes
contrail-kubernetes-kube-manager running Up 23 minutes
contrail-controller-webui-job running Up 23 minutes
contrail-controller-webui-web running Up 23 minutes

WARNING: container with original name 'contrail-external-redis' have Pod or Service empty. Pod: '' / Service: 'redis'. Please pass NODE_TYPE with pod name to container's env

== Contrail control ==
control: active
nodemgr: activ...

Hey Sathish,
     I am not sure how the setup got reimaged.. now i brought into the same state ..please see the deatils

[root@nodeg12 ~]# contrail-status 
Pod              Service         Original Name                          State    Status         
                 redis           contrail-external-redis                running  Up 23 minutes  
analytics        alarm-gen       contrail-analytics-alarm-gen           running  Up 23 minutes  
analytics        api             contrail-analytics-api                 running  Up 23 minutes  
analytics        collector       contrail-analytics-collector           running  Up 23 minutes  
analytics        nodemgr         contrail-nodemgr                       running  Up 23 minutes  
analytics        query-engine    contrail-analytics-query-engine        running  Up 23 minutes  
analytics        snmp-collector  contrail-analytics-snmp-collector      running  Up 23 minutes  
analytics        topology        contrail-analytics-topology            running  Up 23 minutes  
config           api             contrail-controller-config-api         running  Up 23 minutes  
config           device-manager  contrail-controller-config-devicemgr   running  Up 23 minutes  
config           nodemgr         contrail-nodemgr                       running  Up 23 minutes  
config           schema          contrail-controller-config-schema      running  Up 23 minutes  
config           svc-monitor     contrail-controller-config-svcmonitor  running  Up 23 minutes  
config-database  cassandra       contrail-external-cassandra            running  Up 23 minutes  
config-database  nodemgr         contrail-nodemgr                       running  Up 23 minutes  
config-database  rabbitmq        contrail-external-rabbitmq             running  Up 23 minutes  
config-database  zookeeper       contrail-external-zookeeper            running  Up 23 minutes  
control          control         contrail-controller-control-control    running  Up 23 minutes  
control          dns             contrail-controller-control-dns        running  Up 23 minutes  
control          named           contrail-controller-control-named      running  Up 23 minutes  
control          nodemgr         contrail-nodemgr                       running  Up 23 minutes  
database         cassandra       contrail-external-cassandra            running  Up 23 minutes  
database         kafka           contrail-external-kafka                running  Up 23 minutes  
database         nodemgr         contrail-nodemgr                       running  Up 23 minutes  
database         zookeeper       contrail-external-zookeeper            running  Up 23 minutes  
kubernetes       kube-manager    contrail-kubernetes-kube-manager       running  Up 23 minutes  
webui            job             contrail-controller-webui-job          running  Up 23 minutes  
webui            web             contrail-controller-webui-web          running  Up 23 minutes

WARNING: container with original name 'contrail-external-redis' have Pod or Service empty. Pod: '' / Service: 'redis'. Please pass NODE_TYPE with pod name to container's env

== Contrail control ==
control: active
nodemgr: active
named: active
dns: active

== Contrail config-database ==
nodemgr: initializing (Disk for DB is too low. )
zookeeper: active
rabbitmq: active
cassandra: active

== Contrail kubernetes ==
kube-manager: backup

== Contrail database ==
kafka: active
nodemgr: initializing (Disk for DB is too low. )
zookeeper: active
cassandra: active

== Contrail analytics ==
snmp-collector: active
query-engine: active
api: active
alarm-gen: active
nodemgr: active
collector: active
topology: active

== Contrail webui ==
web: active
job: active

== Contrail config ==
svc-monitor: backup
nodemgr: active
device-manager: backup
api: active
schema: backup

[root@nodeg12 ~]# contrail-status 
Pod              Service         Original Name                          State    Status            
                 redis           contrail-external-redis                running  Up 2 hours        
analytics        alarm-gen       contrail-analytics-alarm-gen           running  Up 2 hours        
analytics        api             contrail-analytics-api                 running  Up 2 hours        
analytics        collector       contrail-analytics-collector           running  Up 2 hours        
analytics        nodemgr         contrail-nodemgr                       running  Up 2 hours        
analytics        query-engine    contrail-analytics-query-engine        running  Up 2 hours        
analytics        snmp-collector  contrail-analytics-snmp-collector      running  Up 2 hours        
analytics        topology        contrail-analytics-topology            running  Up 2 hours        
config           api             contrail-controller-config-api         running  Up About an hour  
config           device-manager  contrail-controller-config-devicemgr   running  Up 2 hours        
config           nodemgr         contrail-nodemgr                       running  Up 2 hours        
config           schema          contrail-controller-config-schema      running  Up 2 hours        
config           svc-monitor     contrail-controller-config-svcmonitor  running  Up 2 hours        
config-database  cassandra       contrail-external-cassandra            running  Up 2 hours        
config-database  nodemgr         contrail-nodemgr                       running  Up 2 hours        
config-database  rabbitmq        contrail-external-rabbitmq             running  Up 2 hours        
config-database  zookeeper       contrail-external-zookeeper            running  Up 2 hours        
control          control         contrail-controller-control-control    running  Up 2 hours        
control          dns             contrail-controller-control-dns        running  Up 2 hours        
control          named           contrail-controller-control-named      running  Up 2 hours        
control          nodemgr         contrail-nodemgr                       running  Up 2 hours        
database         cassandra       contrail-external-cassandra            running  Up 2 hours        
database         kafka           contrail-external-kafka                running  Up 2 hours        
database         nodemgr         contrail-nodemgr                       running  Up 2 hours        
database         zookeeper       contrail-external-zookeeper            running  Up 2 hours        
kubernetes       kube-manager    contrail-kubernetes-kube-manager       running  Up 2 minutes      
webui            job             contrail-controller-webui-job          running  Up 2 hours        
webui            web             contrail-controller-webui-web          running  Up 2 hours

WARNING: container with original name 'contrail-external-redis' have Pod or Service empty. Pod: '' / Service: 'redis'. Please pass NODE_TYPE with pod name to container's env

== Contrail control ==
control: active
nodemgr: active
named: active
dns: active

== Contrail config-database ==
nodemgr: initializing (Disk for DB is too low. )
zookeeper: active
rabbitmq: active
cassandra: active

== Contrail kubernetes ==
kube-manager: backup

== Contrail database ==
kafka: active
nodemgr: initializing (Disk for DB is too low. )
zookeeper: active
cassandra: active

== Contrail analytics ==
snmp-collector: active
query-engine: active
api: active
alarm-gen: active
nodemgr: active
collector: active
topology: active

== Contrail webui ==
web: active
job: active

== Contrail config ==
svc-monitor: backup
nodemgr: active
device-manager: backup
api: active
schema: backup

[root@nodeg12 ~]# uptime
 05:15:55 up  2:48,  1 user,  load average: 0.76, 0.89, 1.02
[root@nodeg12 ~]# 
[root@nodeg12 ~]# 
[root@nodeg12 ~]# contrail-status 
Pod              Service         Original Name                          State    Status         
                 redis           contrail-external-redis                running  Up 2 hours     
analytics        alarm-gen       contrail-analytics-alarm-gen           running  Up 2 hours     
analytics        api             contrail-analytics-api                 running  Up 2 hours     
analytics        collector       contrail-analytics-collector           running  Up 2 hours     
analytics        nodemgr         contrail-nodemgr                       running  Up 2 hours     
analytics        query-engine    contrail-analytics-query-engine        running  Up 2 hours     
analytics        snmp-collector  contrail-analytics-snmp-collector      running  Up 2 hours     
analytics        topology        contrail-analytics-topology            running  Up 2 hours     
config           api             contrail-controller-config-api         running  Up 2 hours     
config           device-manager  contrail-controller-config-devicemgr   running  Up 2 hours     
config           nodemgr         contrail-nodemgr                       running  Up 2 hours     
config           schema          contrail-controller-config-schema      running  Up 2 hours     
config           svc-monitor     contrail-controller-config-svcmonitor  running  Up 2 hours     
config-database  cassandra       contrail-external-cassandra            running  Up 2 hours     
config-database  nodemgr         contrail-nodemgr                       running  Up 2 hours     
config-database  rabbitmq        contrail-external-rabbitmq             running  Up 2 hours     
config-database  zookeeper       contrail-external-zookeeper            running  Up 2 hours     
control          control         contrail-controller-control-control    running  Up 2 hours     
control          dns             contrail-controller-control-dns        running  Up 2 hours     
control          named           contrail-controller-control-named      running  Up 2 hours     
control          nodemgr         contrail-nodemgr                       running  Up 2 hours     
database         cassandra       contrail-external-cassandra            running  Up 2 hours     
database         kafka           contrail-external-kafka                running  Up 2 hours     
database         nodemgr         contrail-nodemgr                       running  Up 2 hours     
database         zookeeper       contrail-external-zookeeper            running  Up 2 hours     
kubernetes       kube-manager    contrail-kubernetes-kube-manager       running  Up 25 minutes  
webui            job             contrail-controller-webui-job          running  Up 2 hours     
webui            web             contrail-controller-webui-web          running  Up 2 hours

WARNING: container with original name 'contrail-external-redis' have Pod or Service empty. Pod: '' / Service: 'redis'. Please pass NODE_TYPE with pod name to container's env

== Contrail control ==
control: active
nodemgr: active
named: active
dns: active

== Contrail config-database ==
nodemgr: initializing (Disk for DB is too low. )
zookeeper: active
rabbitmq: active
cassandra: active

== Contrail kubernetes ==
kube-manager: backup

== Contrail database ==
kafka: active
nodemgr: initializing (Disk for DB is too low. )
zookeeper: active
cassandra: active

== Contrail analytics ==
snmp-collector: active
query-engine: active
api: active
alarm-gen: active
nodemgr: active
collector: active
topology: active

== Contrail webui ==
web: active
job: active

== Contrail config ==
svc-monitor: backup
nodemgr: active
device-manager: backup
api: active
schema: backup

[root@nodeg12 ~]# docker ps
CONTAINER ID        IMAGE                                                                      COMMAND                  CREATED             STATUS              PORTS               NAMES
47fbdefb393b        gcr.io/google_containers/kube-proxy-amd64                                  "/usr/local/bin/kube…"   2 hours ago         Up 2 hours                              k8s_kube-proxy_kube-proxy-t4hpf_kube-system_b0b61f62-c81a-11e8-ad25-002590c476a0_3
d74b3d2127f3        gcr.io/google_containers/kube-scheduler-amd64                              "kube-scheduler --ad…"   2 hours ago         Up 2 hours                              k8s_kube-scheduler_kube-scheduler-nodeg12_kube-system_69c12074e336b0dbbd0a1666ce05226a_3
42e9a6faf399        gcr.io/google_containers/etcd-amd64                                        "etcd --listen-clien…"   2 hours ago         Up 2 hours                              k8s_etcd_etcd-nodeg12_kube-system_7278f85057e8bf5cb81c9f96d3b25320_3
0b363a5650e7        gcr.io/google_containers/kube-controller-manager-amd64                     "kube-controller-man…"   2 hours ago         Up 2 hours                              k8s_kube-controller-manager_kube-controller-manager-nodeg12_kube-system_6613feb3dcb4087b648d100eecfe2537_3
3b91ead2d56c        gcr.io/google_containers/kube-apiserver-amd64                              "kube-apiserver --re…"   2 hours ago         Up About an hour                        k8s_kube-apiserver_kube-apiserver-nodeg12_kube-system_02743ceadce8698b3cb9caa38d4d662f_3
7dc557bfe154        gcr.io/google_containers/pause-amd64:3.0                                   "/pause"                 2 hours ago         Up 2 hours                              k8s_POD_kube-scheduler-nodeg12_kube-system_69c12074e336b0dbbd0a1666ce05226a_6
ee142c7ebc5d        gcr.io/google_containers/pause-amd64:3.0                                   "/pause"                 2 hours ago         Up 2 hours                              k8s_POD_etcd-nodeg12_kube-system_7278f85057e8bf5cb81c9f96d3b25320_4
d37ff1dfb225        gcr.io/google_containers/pause-amd64:3.0                                   "/pause"                 2 hours ago         Up 2 hours                              k8s_POD_kube-proxy-t4hpf_kube-system_b0b61f62-c81a-11e8-ad25-002590c476a0_6
022a9315091d        gcr.io/google_containers/pause-amd64:3.0                                   "/pause"                 2 hours ago         Up 2 hours                              k8s_POD_kube-controller-manager-nodeg12_kube-system_6613feb3dcb4087b648d100eecfe2537_5
24f738ef7f25        gcr.io/google_containers/pause-amd64:3.0                                   "/pause"                 2 hours ago         Up 2 hours                              k8s_POD_kube-apiserver-nodeg12_kube-system_02743ceadce8698b3cb9caa38d4d662f_6
bcc486631b06        10.204.217.152:5000/contrail-kubernetes-kube-manager:queens-5.0-279        "/entrypoint.sh /usr…"   2 hours ago         Up 25 minutes                           kubemanager_kubemanager_1
ea60afdf0f15        10.204.217.152:5000/contrail-nodemgr:queens-5.0-279                        "/entrypoint.sh /bin…"   2 hours ago         Up 2 hours                              analytics_nodemgr_1
c3e3c9cd97bf        10.204.217.152:5000/contrail-analytics-collector:queens-5.0-279            "/entrypoint.sh /usr…"   2 hours ago         Up 2 hours                              analytics_collector_1
d0e099c893c9        10.204.217.152:5000/contrail-analytics-topology:queens-5.0-279             "/entrypoint.sh /usr…"   2 hours ago         Up 2 hours                              analytics_topology_1
6f1f132a81d6        10.204.217.152:5000/contrail-analytics-alarm-gen:queens-5.0-279            "/entrypoint.sh /usr…"   2 hours ago         Up 2 hours                              analytics_alarm-gen_1
aaae1239ac2e        10.204.217.152:5000/contrail-analytics-query-engine:queens-5.0-279         "/entrypoint.sh /usr…"   2 hours ago         Up 2 hours                              analytics_query-engine_1
497718f82f10        10.204.217.152:5000/contrail-analytics-api:queens-5.0-279                  "/entrypoint.sh /usr…"   2 hours ago         Up 2 hours                              analytics_api_1
49b185bd3b7a        10.204.217.152:5000/contrail-analytics-snmp-collector:queens-5.0-279       "/entrypoint.sh /usr…"   2 hours ago         Up 2 hours                              analytics_snmp-collector_1
bc4031cc8310        10.204.217.152:5000/contrail-nodemgr:queens-5.0-279                        "/entrypoint.sh /bin…"   2 hours ago         Up 2 hours                              analytics_database_nodemgr_1
57ea8bb2263d        10.204.217.152:5000/contrail-external-zookeeper:queens-5.0-279             "/contrail-entrypoin…"   2 hours ago         Up 2 hours                              analytics_database_zookeeper_1
29e0e4a3ff35        10.204.217.152:5000/contrail-external-cassandra:queens-5.0-279             "/contrail-entrypoin…"   2 hours ago         Up 2 hours                              analytics_database_cassandra_1
7d400c33ef45        10.204.217.152:5000/contrail-external-kafka:queens-5.0-279                 "/docker-entrypoint.…"   2 hours ago         Up 2 hours                              analytics_database_kafka_1
ced4badb0d62        10.204.217.152:5000/contrail-controller-control-named:queens-5.0-279       "/entrypoint.sh /usr…"   2 hours ago         Up 2 hours                              control_named_1
3b93407b9e81        10.204.217.152:5000/contrail-controller-control-dns:queens-5.0-279         "/entrypoint.sh /usr…"   2 hours ago         Up 2 hours                              control_dns_1
32f034f40681        10.204.217.152:5000/contrail-controller-control-control:queens-5.0-279     "/entrypoint.sh /usr…"   2 hours ago         Up 2 hours                              control_control_1
d10f49b37302        10.204.217.152:5000/contrail-nodemgr:queens-5.0-279                        "/entrypoint.sh /bin…"   2 hours ago         Up 2 hours                              control_nodemgr_1
cf165a2e47fa        10.204.217.152:5000/contrail-controller-webui-job:queens-5.0-279           "/entrypoint.sh /usr…"   2 hours ago         Up 2 hours                              webui_job_1
e08c97c4bb3d        10.204.217.152:5000/contrail-controller-webui-web:queens-5.0-279           "/entrypoint.sh /usr…"   2 hours ago         Up 2 hours                              webui_web_1
74f34f4e851f        10.204.217.152:5000/contrail-controller-config-schema:queens-5.0-279       "/entrypoint.sh /usr…"   2 hours ago         Up 2 hours                              config_schema_1
c7e80be84118        10.204.217.152:5000/contrail-controller-config-devicemgr:queens-5.0-279    "/entrypoint.sh /usr…"   2 hours ago         Up 2 hours                              config_devicemgr_1
a964599134f0        10.204.217.152:5000/contrail-nodemgr:queens-5.0-279                        "/entrypoint.sh /bin…"   2 hours ago         Up 2 hours                              config_nodemgr_1
9827f67b34bd        10.204.217.152:5000/contrail-controller-config-svcmonitor:queens-5.0-279   "/entrypoint.sh /usr…"   2 hours ago         Up 2 hours                              config_svcmonitor_1
d9dfc28a3ee9        10.204.217.152:5000/contrail-controller-config-api:queens-5.0-279          "/entrypoint.sh /usr…"   2 hours ago         Up 2 hours                              config_api_1
d51f525ac82b        10.204.217.152:5000/contrail-nodemgr:queens-5.0-279                        "/entrypoint.sh /bin…"   2 hours ago         Up 2 hours                              config_database_nodemgr_1
e17724139ee7        10.204.217.152:5000/contrail-external-rabbitmq:queens-5.0-279              "/contrail-entrypoin…"   2 hours ago         Up 2 hours                              config_database_rabbitmq_1
cfc1bfcd1843        10.204.217.152:5000/contrail-external-zookeeper:queens-5.0-279             "/contrail-entrypoin…"   2 hours ago         Up 2 hours                              config_database_zookeeper_1
2532bb4af1e4        10.204.217.152:5000/contrail-external-cassandra:queens-5.0-279             "/contrail-entrypoin…"   2 hours ago         Up 2 hours                              config_database_cassandra_1
28f2657ccd60        10.204.217.152

Revision history for this message

Venkatesh Velpula (vvelpula) wrote on 2018-10-05:

#12

reserved the setup too...

[root@nodem4 ~]# cat /cs-shared/testbed_locks/testbed_k8s_multi_intf_ha_sanity_setup.py
vvelpula for debugging config api restart issue
[root@nodem4 ~]#

Sathish Holla (sathishholla) on 2018-10-08

tags:

added: releasenote

Revision history for this message

Sathish Holla (sathishholla) wrote on 2018-10-08:

#13

This looks like a case of rabbitMQ race condition during initialization.

As part of test case, the docker service was restarted on all 3 controller nodes.
After docker restart, When the rabbitmq service is trying to come up, there's a race condition between two rabbitMQ nodes and both end up being master nodes.
Due to this, there is a rabbitMQ cluster partition.

This is a known rabbitMQ bug and rabbitMQ proposes the following workaround to handle such cases:
https://github.com/rabbitmq/rabbitmq-server/issues/1202

To implement the above workaround in Contrail, we will need to update the current rabbitMQ version from 3.6 to 3.7.

Thanks,
Sathish

Revision history for this message

Shivayogi Ugaji (shivayogi123) wrote on 2018-10-11:

#14

Moving this to 5.0.3 as we do not want to upgrade rabbitmq in 5.0.2.

Hi Venky,

To recover this, we will need to do the following:
Login to the RabbitMQ docker on all the controllers, backup the directory “/var/lib/rabbitmq/mnesia/contrail@<Node_Name>” and delete it.
Once this folder is deleted in all the RabbitMQ dockers, restart the rabbitmq docker in one of the controller.
Wait for about 10 secs, and restart the rabbitmq docker on other two controller.
To verify that the rabbitmq cluster is correct, execute the following command and verify that all three nodes are present in the field “running_nodes” :
root@Config2:~/mnesia/contrail@Config2# rabbitmqctl cluster_status
Cluster status of node contrail@Config2
[{nodes,[{disc,[contrail@Config1,contrail@Config2,contrail@Config3]}]},
{running_nodes,[contrail@Config3,contrail@Config1,contrail@Config2]},
{cluster_name,<<"<email address hidden>">>},
{partitions,[]},
{alarms,[{contrail@Config3,[]},{contrail@Config1,[]},{contrail@Config2,[]}]}]

Thanks,
Sathish

From: Venkatesh Velpula <email address hidden>
Date: Tuesday, October 9, 2018 at 9:33 PM
To: Jeba Paulaiyan <email address hidden>, Abhay Joshi <email address hidden>, Shivayogi Ugaji <email address hidden>
Cc: Sathish Holla <email address hidden>, Sudheendra Rao <email address hidden>
Subject: Re: https://bugs.launchpad.net/juniperopenstack/r5.0/+bug/1793269

Hi Jeba,
This is not happening always …but when it happens the impact is catastrophic..

Satish ,
Could you please help us with recovery mechanism , we can release not the same for 5.0.2

Thanks
-Venky

From: Jeba Paulaiyan <email address hidden>
Date: Wednesday, October 10, 2018 at 5:33 AM
To: Abhay Joshi <email address hidden>, Shivayogi Ugaji <email address hidden>, Venkatesh Velpula <email address hidden>
Cc: Sathish Holla <email address hidden>, Madhava Rao Sudheendra Rao <email address hidden>
Subject: Re: https://bugs.launchpad.net/juniperopenstack/r5.0/+bug/1793269

Venky,

This decision is based on assumption that this is not happening always and a race condition in RabbitMQ. Please feel free to disagree.

Thanks,
Jeba

From: Abhay Joshi <email address hidden>
Date: Tuesday, October 9, 2018 at 16:46
To: Shivayogi Ugaji <email address hidden>
Cc: Sathish Holla <email address hidden>, Jeba Paulaiyan <email address hidden>
Subject: Re: https://bugs.launchpad.net/juniperopenstack/r5.0/+bug/1793269

+ Jeba.

As discussed in bug scrub today, we will push this out to 5.1.0. Please update series accordingly.

Thanks,

Abhay

From: Shivayogi Ugaji <email address hidden>
Date: Tuesday, October 9, 2018 at 1:17 PM
To: Abhay Joshi <email address hidden>
Cc: Sathish Holla <email address hidden>
Subject: https://bugs.launchpad.net/juniperopenstack/r5.0/+bug/1793269

Hi Abhay,

This is due to a bug in rabbitMq implementation and the latest version of rabbitMq has the fix.
We need to update the rabbitMq version from 3.6 to 3.7. Any idea who can help with this.

Thanks
Shivayogi

Moving this to 5.0.3 as we do not want to upgrade rabbitmq in 5.0.2.

Hi Venky,
 
To recover this, we will need to do the following:
Login to the RabbitMQ docker on all the controllers, backup the directory “/var/lib/rabbitmq/mnesia/contrail@<Node_Name>” and delete it.
Once this folder is deleted in all the RabbitMQ dockers, restart the rabbitmq docker in one of the controller.
Wait for about 10 secs, and restart the rabbitmq docker on other two controller.
To verify that the rabbitmq cluster is correct, execute the following command and verify that all three nodes are present in the field “running_nodes” :
root@Config2:~/mnesia/contrail@Config2# rabbitmqctl cluster_status
Cluster status of node contrail@Config2
[{nodes,[{disc,[contrail@Config1,contrail@Config2,contrail@Config3]}]},
{running_nodes,[contrail@Config3,contrail@Config1,contrail@Config2]},
{cluster_name,<<"contrail@Config1.englab.juniper.net">>},
{partitions,[]},
{alarms,[{contrail@Config3,[]},{contrail@Config1,[]},{contrail@Config2,[]}]}]
 
Thanks,
Sathish
 
From: Venkatesh Velpula <vvelpula@juniper.net>
Date: Tuesday, October 9, 2018 at 9:33 PM
To: Jeba Paulaiyan <jebap@juniper.net>, Abhay Joshi <abhayj@juniper.net>, Shivayogi Ugaji <yogi@juniper.net>
Cc: Sathish Holla <saholla@juniper.net>, Sudheendra Rao <sudheendra@juniper.net>
Subject: Re: https://bugs.launchpad.net/juniperopenstack/r5.0/+bug/1793269
 
Hi Jeba,
       This is not happening always …but when it happens the impact is catastrophic..
 
Satish ,
       Could you please help us with recovery mechanism , we can release not the same for 5.0.2
 
Thanks
-Venky
 
 
From: Jeba Paulaiyan <jebap@juniper.net>
Date: Wednesday, October 10, 2018 at 5:33 AM
To: Abhay Joshi <abhayj@juniper.net>, Shivayogi Ugaji <yogi@juniper.net>, Venkatesh Velpula <vvelpula@juniper.net>
Cc: Sathish Holla <saholla@juniper.net>, Madhava Rao Sudheendra Rao <sudheendra@juniper.net>
Subject: Re: https://bugs.launchpad.net/juniperopenstack/r5.0/+bug/1793269
 
Venky,
 
        This decision is based on assumption that this is not happening always and a race condition in RabbitMQ. Please feel free to disagree.
 
Thanks,
Jeba
 
From: Abhay Joshi <abhayj@juniper.net>
Date: Tuesday, October 9, 2018 at 16:46
To: Shivayogi Ugaji <yogi@juniper.net>
Cc: Sathish Holla <saholla@juniper.net>, Jeba Paulaiyan <jebap@juniper.net>
Subject: Re: https://bugs.launchpad.net/juniperopenstack/r5.0/+bug/1793269
 
+ Jeba.
 
As discussed in bug scrub today, we will push this out to 5.1.0. Please update series accordingly.

Thanks,

Abhay
 
From: Shivayogi Ugaji <yogi@juniper.net>
Date: Tuesday, October 9, 2018 at 1:17 PM
To: Abhay Joshi <abhayj@juniper.net>
Cc: Sathish Holla <saholla@juniper.net>
Subject: https://bugs.launchpad.net/juniperopenstack/r5.0/+bug/1793269
 
Hi Abhay,
 
This is due to a bug in rabbitMq implementation and the latest version of rabbitMq has the fix.
We need to update the rabbitMq version from 3.6 to 3.7. Any idea who can help with this.
 
Thanks
Shivayogi

Revision history for this message

Jeba Paulaiyan (jebap) wrote on 2018-10-11:

#15

Hi Venky,

I noticed that the previous workaround was wrong. Please find updated instructions (see the change in step 2):
To recover this, we will need to do the following:
Login to the RabbitMQ docker on all the controllers, backup the directory “/var/lib/rabbitmq/mnesia/contrail@<Node_Name>” and delete it.
Once this folder is deleted in all the RabbitMQ dockers, Stop the rabbitmq docker on all the controllers. Start the rabbitmq docker in only one of the controller.
Wait for about 10 secs, and restart the rabbitmq docker on other two controller.
To verify that the rabbitmq cluster is correct, execute the following command and verify that all three nodes are present in the field “running_nodes” :
root@Config2:~/mnesia/contrail@Config2# rabbitmqctl cluster_status
Cluster status of node contrail@Config2
[{nodes,[{disc,[contrail@Config1,contrail@Config2,contrail@Config3]}]},
{running_nodes,[contrail@Config3,contrail@Config1,contrail@Config2]},
{cluster_name,<<"<email address hidden>">>},
{partitions,[]},
{alarms,[{contrail@Config3,[]},{contrail@Config1,[]},{contrail@Config2,[]}]}]

Thanks,
Sathish

Revision history for this message

Jeba Paulaiyan (jebap) wrote on 2018-11-20:

#16

Notes:

k8s pod creation fails after config api restart

To recover this, we will need to do the following:
Login to the RabbitMQ docker on all the controllers, backup the directory “/var/lib/rabbitmq/mnesia/contrail@<Node_Name>” and delete it.
Once this folder is deleted in all the RabbitMQ dockers, Stop the rabbitmq docker on all the controllers. Start the rabbitmq docker in only one of the controller.
Wait for about 10 secs, and restart the rabbitmq docker on other two controller.
To verify that the rabbitmq cluster is correct, execute the following command and verify that all three nodes are present in the field “running_nodes” :
root@Config2:~/mnesia/contrail@Config2# rabbitmqctl cluster_status
Cluster status of node contrail@Config2
[{nodes,[{disc,[contrail@Config1,contrail@Config2,contrail@Config3]}]},
{running_nodes,[contrail@Config3,contrail@Config1,contrail@Config2]},
{cluster_name,<<"<email address hidden>">>},
{partitions,[]},
{alarms,[{contrail@Config3,[]},{contrail@Config1,[]},{contrail@Config2,[]}]}]