[k8s-prov-R5.0-30]: K8s provisioning using contrail-test-ansible failed on 1 of the setup

Bug #1766155 reported by Pulkit Tandon on 2018-04-23
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R5.0
Fix Released
High
Pragash Vijayaragavan
Trunk
Fix Released
High
Pragash Vijayaragavan

Bug Description

R5.0-30
Muti interface virtual setup.
3 Control nodes. 3 Agents

Provisioning did not show any failure but the cluster was not up correctly.

Instances.yaml:
```
global_configuration:
   REGISTRY_PRIVATE_INSECURE: True
   CONTAINER_REGISTRY: ci-repo.englab.juniper.net:5000
provider_config:
  bms:
    domainsuffix: local
    ntpserver: 10.84.5.100
    ssh_pwd: c0ntrail123
    ssh_user: root

instances:
  server1:
      ip: 10.0.0.4
      provider: bms
      roles:
          analytics: null
          analytics_database: null
          config: null
          config_database: null
          control: null
          kubemanager: null
          webui: null
  server2:
      ip: 10.0.0.5
      provider: bms
      roles:
          analytics: null
          analytics_database: null
          config: null
          config_database: null
          control: null
          kubemanager: null
          webui: null
  server3:
      ip: 10.0.0.6
      provider: bms
      roles:
          analytics: null
          analytics_database: null
          config: null
          config_database: null
          control: null
          k8s_master: null
          kubemanager: null
          webui: null
  server4:
      ip: 10.0.0.7
      provider: bms
      roles:
          k8s_node: null
          vrouter: null
  server5:
      ip: 10.0.0.8
      provider: bms
      roles:
          k8s_node: null
          vrouter: null
  server6:
      ip: 10.0.0.9
      provider: bms
      roles:
          k8s_node: null
          vrouter: null

contrail_configuration:
  CONTRAIL_VERSION: ocata-5.0-30
  CLOUD_ORCHESTRATOR: kubernetes
  METADATA_PROXY_SECRET: c0ntrail123
  CONTROLLER_NODES: 10.10.0.4,10.10.0.5,10.10.0.6
  AAA_MODE: no-auth
  CONTROLLER_NODES: 10.10.0.4,10.10.0.5,10.10.0.6
  CONTROL_DATA_NET_LIST: 10.10.0.0/24
  PHYSICAL_INTERFACE: eth1
  VROUTER_GATEWAY: 10.10.0.1
  two_interface: true

```

Cluster status on all 3 Control nodes:
Server1:
```
== Contrail control ==
control: initializing (No BGP configuration for self)
nodemgr: initializing
named: active
dns: active

== Contrail analytics ==
snmp-collector: active
query-engine: active
api: initializing (Redis-UVE:10.10.0.4:6379[None], Redis-UVE:10.10.0.5:6379[None] connection down)
alarm-gen: initializing (Redis-UVE:10.10.0.4:6379[None], Redis-UVE:10.10.0.5:6379[None] connection down)
nodemgr: initializing
collector: initializing
topology: active

== Contrail config ==
api: active
zookeeper: active
svc-monitor: backup
nodemgr: initializing
device-manager: initializing (ApiServer:ApiServer[] connection down)
cassandra: active
rabbitmq: active
schema: backup

== Contrail webui ==
web: active
job: active

== Contrail database ==
kafka: active
nodemgr: initializing
zookeeper: active
cassandra: active

```

Server2:
```
== Contrail control ==
control: initializing (No BGP configuration for self)
nodemgr: initializing
named: active
dns: active

== Contrail analytics ==
snmp-collector: active
query-engine: active
api: initializing (Redis-UVE:10.10.0.4:6379[None], Redis-UVE:10.10.0.5:6379[None] connection down)
alarm-gen: initializing (Redis-UVE:10.10.0.4:6379[None], Redis-UVE:10.10.0.5:6379[None] connection down)
nodemgr: initializing
collector: initializing
topology: active

== Contrail config ==
api: active
zookeeper: active
svc-monitor: backup
nodemgr: initializing
device-manager: backup
cassandra: active
rabbitmq: active
schema: initializing (ApiServer:ApiServer[] connection down)

== Contrail webui ==
web: active
job: active

== Contrail database ==
kafka: active
nodemgr: initializing
zookeeper: active
cassandra: active

```

Server3:
```
== Contrail control ==
control: initializing (No BGP configuration for self)
nodemgr: initializing
named: active
dns: active

== Contrail kubernetes ==
kube-manager: initializing (ApiServer:ApiServer[] connection down)

== Contrail database ==
kafka: active
nodemgr: initializing
zookeeper: active
cassandra: active

== Contrail analytics ==
snmp-collector: active
query-engine: active
api: initializing (Redis-UVE:10.10.0.4:6379[None], Redis-UVE:10.10.0.5:6379[None] connection down)
alarm-gen: initializing (Redis-UVE:10.10.0.4:6379[None], Redis-UVE:10.10.0.5:6379[None] connection down)
nodemgr: initializing
collector: initializing (Database:server3:Global connection down)
topology: active

== Contrail webui ==
web: active
job: active

== Contrail config ==
api: active
zookeeper: active
svc-monitor: backup
nodemgr: initializing
device-manager: backup
cassandra: active
rabbitmq: active
schema: backup

```

LOGS:

Contrail-collector.log
Server3:
```
4ce9-628a-4850-b069-a7c39f05816d COLUMN FAILED
2018-04-23 Mon 04:16:49:692.246 UTC server3 [Thread 140493048489728, Pid 1]: server3:Global: ObjectTableInsert: Addition of server1:10.10.0.8, message UUID 01e9e81e-77bc-4082-8098-ed60b25fe961 ObjectXmppPeerInfo into table ObjectValueTable FAILED
2018-04-23 Mon 04:16:49:692.317 UTC server3 [Thread 140493048489728, Pid 1]: server3:Global: MessageTableOnlyInsert: Addition of message: XMPPPeerInfo, message UUID: 01e9e81e-77bc-4082-8098-ed60b25fe961 COLUMN FAILED
2018-04-23 Mon 04:16:49:692.526 UTC server3 [Thread 140493048489728, Pid 1]: server3:Global: ObjectTableInsert: Addition of server1:10.10.0.8, message UUID 43ce2323-f59e-49de-9b8f-874673b040fa ObjectXmppPeerInfo into table ObjectValueTable FAILED
2018-04-23 Mon 04:16:49:692.584 UTC server3 [Thread 140493048489728, Pid 1]: server3:Global: MessageTableOnlyInsert: Addition of message: PeerStatsUve, message UUID: 43ce2323-f59e-49de-9b8f-874673b040fa COLUMN FAILED
2018-04-23 Mon 04:16:49:693.070 UTC server3 [Thread 140493048489728, Pid 1]: server3:Global: ObjectTableInsert: Addition of server1:10.10.0.8, message UUID 51e54dbe-2717-49dd-ac64-612a45a525af ObjectXmppPeerInfo into table ObjectValueTable FAILED
2018-04-23 Mon 04:16:49:693.152 UTC server3 [Thread 140493048489728, Pid 1]: server3:Global: MessageTableOnlyInsert: Addition of message: XMPPPeerInfo, message UUID: 51e54dbe-2717-49dd-ac64-612a45a525af COLUMN FAILED
```

Server 1 and Server 2:
```
2018-04-22 Sun 19:09:11:693.468 UTC server1 [Thread 140326290929856, Pid 1]: Kafka new topic -uve-25 Err
2018-04-22 Sun 19:09:11:693.484 UTC server1 [Thread 140326290929856, Pid 1]: Kafka new topic -uve-26 Err
2018-04-22 Sun 19:09:11:693.502 UTC server1 [Thread 140326290929856, Pid 1]: Kafka new topic -uve-27 Err
2018-04-22 Sun 19:09:11:693.518 UTC server1 [Thread 140326290929856, Pid 1]: Kafka new topic -uve-28 Err
2018-04-22 Sun 19:09:11:693.547 UTC server1 [Thread 140326290929856, Pid 1]: Kafka new topic -uve-29 Err
2018-04-22 Sun 19:09:11:693.564 UTC server1 [Thread 140326290929856, Pid 1]: Kafka new topic -uve-agg-UveVirtualNetworkAgent-egress_flow_count Err
2018-04-22 Sun 19:09:11:693.586 UTC server1 [Thread 140326290929856, Pid 1]: Kafka new topic -uve-agg-UveVirtualNetworkAgent-ingress_flow_count Err
2018-04-22 Sun 19:09:11:694.827 UTC server1 [Thread 140326290929856, Pid 1]: Kafka consumer Err
2018-04-22 Sun 19:09:11:695.324 UTC server1 [Thread 140326290929856, Pid 1]: [SYS_NOTICE]: NodeStatusUVE: data= [ name = server1 process_status= [ [ [ module_id = contrail-collector instance_id = 0 state = Non-Functional connection_infos= [ [ [ type = Redis-UVE name = To server_addrs= [ [ (*_iter6) = 127.0.0.1:6379, ] ] status = Initializing description = ], [ type = KafkaPub name = 10.10.0.4:9092,10.10.0.5:9092,10.10.0.6:9092 server_addrs= [ [ (*_iter6) = 0.0.0.0:0, ] ] status = Initializing description = ], ] ] description = Number of connections:2, Expected:7 Missing: Collector,Redis-UVE:From,Database:server1:Global,Database:Cassandra,Database:RabbitMQ ], ] ] ]
2018-04-22 Sun 19:09:11:696.039 UTC server1 [Thread 140326290929856, Pid 1]: [SYS_NOTICE]: NodeStatusUVE: data= [ name = server1 process_status= [ [ [ module_id = contrail-collector instance_id = 0 state = Non-Functional connection_infos= [ [ [ type = Redis-UVE name = From server_addrs= [ [ (*_iter6) = 127.0.0.1:6379, ] ] status = Initializing description = ], [ type = Redis-UVE name = To server_addrs= [ [ (*_iter6) = 127.0.0.1:6379, ] ] status = Initializing description = ], [ type = KafkaPub name = 10.10.0.4:9092,10.10.0.5:9092,10.10.0.6:9092 server_addrs= [ [ (*_iter6) = 0.0.0.0:0, ] ] status = Initializing description = ], ] ] description = Number of connections:3, Expected:7 Missing: Collector,Database:server1:Global,Database:Cassandra,Database:RabbitMQ ], ] ] ]

2018-04-23 Mon 01:29:17:565.101 UTC server1 [Thread 140325809010432, Pid 1]: [SYS_NOTICE]: KafkaAggStatusTrace: data= [ name = server1 assign_offsets= [ [ _iter40->first = -uve-agg-UveVirtualNetworkAgent-egress_flow_count [ topic_offsets= [ [ [ partition = 0 offset = 18446744073709550615 ], [ partition = 1 offset = 18446744073709550615 ], [ partition = 2 offset = 18446744073709550615 ], [ partition = 3 offset = 18446744073709550615 ], [ partition = 4 offset = 18446744073709550615 ], [ partition = 5 offset = 18446744073709550615 ], [ partition = 6 offset = 18446744073709550615 ], [ partition = 7 offset = 18446744073709550615 ], [ partition = 8 offset = 18446744073709550615 ], [ partition = 9 offset = 18446744073709550615 ], [ partition = 10 offset = 18446744073709550615 ], [ partition = 11 offset = 18446744073709550615 ], [ partition = 12 offset = 18446744073709550615 ], [ partition = 13 offset = 18446744073709550615 ], [ partition = 14 offset = 18446744073709550615 ], ] ] ], _iter40->first = -uve-agg-UveVirtualNetworkAgent-ingress_flow_count [ topic_offsets= [ [ [ partition = 0 offset = 18446744073709550615 ], [ partition = 1 offset = 18446744073709550615 ], [ partition = 2 offset = 18446744073709550615 ], [ partition = 3 offset = 18446744073709550615 ], [ partition = 4 offset = 18446744073709550615 ], [ partition = 5 offset = 18446744073709550615 ], [ partition = 6 offset = 18446744073709550615 ], [ partition = 7 offset = 18446744073709550615 ], [ partition = 8 offset = 18446744073709550615 ], [ partition = 9 offset = 18446744073709550615 ], [ partition = 10 offset = 18446744073709550615 ], [ partition = 11 offset = 18446744073709550615 ], [ partition = 12 offset = 18446744073709550615 ], [ partition = 13 offset = 18446744073709550615 ], [ partition = 14 offset = 18446744073709550615 ], ] ] ], ] ] ]
2018-04-23 Mon 01:29:29:428.414 UTC server1 [Thread 140325809010432, Pid 1]: UnAssign -uve-agg-UveVirtualNetworkAgent-egress_flow_count : [ 0:18446744073709550616, 1:18446744073709550616, 2:18446744073709550616, 3:18446744073709550616, 4:18446744073709550616, 5:18446744073709550616, 6:18446744073709550616, 7:18446744073709550616, 8:18446744073709550616, 9:18446744073709550616, 10:18446744073709550616, 11:18446744073709550616, 12:18446744073709550616, 13:18446744073709550616, 14:18446744073709550616, ]
-uve-agg-UveVirtualNetworkAgent-ingress_flow_count : [ 0:18446744073709550616, 1:18446744073709550616, 2:18446744073709550616, 3:18446744073709550616, 4:18446744073709550616, 5:18446744073709550616, 6:18446744073709550616, 7:18446744073709550616, 8:18446744073709550616, 9:18446744073709550616, 10:18446744073709550616, 11:18446744073709550616, 12:18446744073709550616, 13:18446744073709550616, 14:18446744073709550616, ]

2018-04-23 Mon 01:29:30:227.865 UTC server1 [Thread 140325809010432, Pid 1]: Assign -uve-agg-UveVirtualNetworkAgent-egress_flow_count : [ 0:18446744073709550615, 1:18446744073709550615, 2:18446744073709550615, 3:18446744073709550615, 4:18446744073709550615, 5:18446744073709550615, 6:18446744073709550615, 7:18446744073709550615, 8:18446744073709550615, 9:18446744073709550615, ]
-uve-agg-UveVirtualNetworkAgent-ingress_flow_count : [ 0:18446744073709550615, 1:18446744073709550615, 2:18446744073709550615, 3:18446744073709550615, 4:18446744073709550615, 5:18446744073709550615, 6:18446744073709550615, 7:18446744073709550615, 8:18446744073709550615, 9:18446744073709550615, ]

2018-04-23 Mon 01:29:30:228.179 UTC server1 [Thread 140325809010432, Pid 1]: [SYS_NOTICE]: KafkaAggStatusTrace: data= [ name = server1 assign_offsets= [ [ _iter40->first = -uve-agg-UveVirtualNetworkAgent-egress_flow_count [ topic_offsets= [ [ [ partition = 0 offset = 18446744073709550615 ], [ partition = 1 offset = 18446744073709550615 ], [ partition = 2 offset = 18446744073709550615 ], [ partition = 3 offset = 18446744073709550615 ], [ partition = 4 offset = 18446744073709550615 ], [ partition = 5 offset = 18446744073709550615 ], [ partition = 6 offset = 18446744073709550615 ], [ partition = 7 offset = 18446744073709550615 ], [ partition = 8 offset = 18446744073709550615 ], [ partition = 9 offset = 18446744073709550615 ], ] ] ], _iter40->first = -uve-agg-UveVirtualNetworkAgent-ingress_flow_count [ topic_offsets= [ [ [ partition = 0 offset = 18446744073709550615 ], [ partition = 1 offset = 18446744073709550615 ], [ partition = 2 offset = 18446744073709550615 ], [ partition = 3 offset = 18446744073709550615 ], [ partition = 4 offset = 18446744073709550615 ], [ partition = 5 offset = 18446744073709550615 ], [ partition = 6 offset = 18446744073709550615 ], [ partition = 7 offset = 18446744073709550615 ], [ partition = 8 offset = 18446744073709550615 ], [ partition = 9 offset = 18446744073709550615 ], ] ] ], ] ] ]
2018-04-23 Mon 03:25:13:839.842 UTC server1 [Thread 140325809010432, Pid 1]: Local: Message timed out : 10.10.0.6:9092/3: 1 request(s) timed out: disconnect (average rtt 102.102ms)

```

Analytics-api.log
Server3:
```
 for OpServer info None
04/22/2018 07:13:32 PM [contrail-analytics-api]: Analytics Discovery OpServer reconnect
04/22/2018 07:13:32 PM [contrail-analytics-api]: Analytics Discovery cannot publish while down
04/22/2018 07:13:32 PM [contrail-analytics-api]: Exception SessionExpiredError in AnalyticsDiscovery reconnect. Args:
() : traceback Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/opserver/opserver_util.py", line 255, in _run
    self._zk.ensure_path(self._basepath + "/" + wk)
  File "/usr/lib/python2.7/site-packages/kazoo/client.py", line 923, in ensure_path
    return self.ensure_path_async(path, acl).get()
  File "/usr/lib/python2.7/site-packages/kazoo/handlers/utils.py", line 78, in get
    raise self._exception
SessionExpiredError
 for OpServer info None
04/22/2018 07:13:32 PM [contrail-analytics-api]: Analytics Discovery OpServer reconnect
04/22/2018 07:13:32 PM [contrail-analytics-api]: Analytics Discovery cannot publish while down
04/22/2018 07:13:32 PM [contrail-analytics-api]: Analytics Discovery listen CONNECTED
04/22/2018 07:13:32 PM [contrail-analytics-api]: Analytics Discovery to publish None

04/23/2018 04:04:36 AM [contrail-analytics-api]: before res list is bgp-peer
04/23/2018 04:04:36 AM [contrail-analytics-api]: res list is None
04/23/2018 04:04:37 AM [contrail-analytics-api]: before res list is bgp-peer
04/23/2018 04:04:37 AM [contrail-analytics-api]: res list is None
04/23/2018 04:04:58 AM [contrail-analytics-api]: before res list is vrouter
04/23/2018 04:04:58 AM [contrail-analytics-api]: res list is None
04/23/2018 04:08:28 AM [contrail-analytics-api]: before res list is generator
04/23/2018 04:08:29 AM [contrail-analytics-api]: res list is None
04/23/2018 04:08:50 AM [contrail-analytics-api]: before res list is database-node
04/23/2018 04:08:50 AM [contrail-analytics-api]: res list is None
04/23/2018 04:08:50 AM [contrail-analytics-api]: before res list is database-node
04/23/2018 04:08:50 AM [contrail-analytics-api]: res list is None
04/23/2018 04:09:36 AM [contrail-analytics-api]: before res list is bgp-peer
04/23/2018 04:09:36 AM [contrail-analytics-api]: res list is None
04/23/2018 04:09:37 AM [contrail-analytics-api]: before res list is bgp-peer
04/23/2018 04:09:37 AM [contrail-analytics-api]: res list is None
04/23/2018 04:09:59 AM [contrail-analytics-api]: before res list is vrouter
04/23/2018 04:09:59 AM [contrail-analytics-api]: res list is None
04/23/2018 04:10:30 AM [contrail-analytics-api]: before res list is prouter
04/23/2018 04:10:30 AM [contrail-analytics-api]: res list is None
04/23/2018 04:10:30 AM [contrail-analytics-api]: before res list is vrouter
04/23/2018 04:10:30 AM [contrail-analytics-api]: res list is None
04/23/2018 04:12:29 AM [contrail-analytics-api]: before res list is generator
04/23/2018 04:12:29 AM [contrail-analytics-api]: res list is None
04/23/2018 04:14:36 AM [contrail-analytics-api]: before res list is bgp-peer
04/23/2018 04:14:36 AM [contrail-analytics-api]: res list is None
04/23/2018 04:14:37 AM [contrail-analytics-api]: before res list is bgp-peer
04/23/2018 04:14:37 AM [contrail-analytics-api]: res list is None
04/23/2018 04:14:59 AM [contrail-analytics-api]: before res list is vrouter
04/23/2018 04:14:59 AM [contrail-analytics-api]: res list is None
04/23/2018 04:16:29 AM [contrail-analytics-api]: before res list is generator
04/23/2018 04:16:29 AM [contrail-analytics-api]: res list is None
04/23/2018 04:18:50 AM [contrail-analytics-api]: before res list is database-node
04/23/2018 04:18:50 AM [contrail-analytics-api]: res list is None
04/23/2018 04:18:50 AM [contrail-analytics-api]: before res list is database-node
04/23/2018 04:18:50 AM [contrail-analytics-api]: res list is None

```

Status at agent:
```
vrouter kernel module is PRESENT
== Contrail vrouter ==
nodemgr: initializing
agent: initializing (XMPP:control-node:10.10.0.5, XMPP:control-node:10.10.0.6 connection down, No Configuration for self)

```

Pulkit Tandon (pulkitt) wrote :

The occurrence of this issue has not been seen again.
Just seen once when the bug was reported. Can be kept under observation for few days

Pulkit Tandon (pulkitt) wrote :

Not observed since long in many of the builds. Neither master branch nor R5.0.
Hence closing it.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers