Juniper Openstack

Bug #1764493
Comment #8

Comment 8 for bug 1764493

Revision history for this message

Pulkit Tandon (pulkitt) wrote on 2018-04-17: Re: Debugging required on k8s sanity setup which failed for R5.0-16

Hi Michael,

I did not explicitly mention any resources to instances.

Following is my instances.yaml:

global_configuration:
   REGISTRY_PRIVATE_INSECURE: True
   CONTAINER_REGISTRY: 10.204.217.152:5000
provider_config:
  bms:
    domainsuffix: englab.juniper.net
    ntpserver: 10.204.217.158
    ssh_pwd: c0ntrail123
    ssh_user: root

instances:
  nodec58:
      ip: 10.204.217.98
      provider: bms
      roles:
          analytics: null
          analytics_database: null
          config: null
          config_database: null
          control: null
          k8s_master: null
          kubemanager: null
          webui: null
  nodec60:
      ip: 10.204.217.100
      provider: bms
      roles:
          k8s_node: null
          vrouter:
              PHYSICAL_INTERFACE: bond0
  nodec61:
      ip: 10.204.217.101
      provider: bms
      roles:
          k8s_node: null
          vrouter:
              PHYSICAL_INTERFACE: enp2s0f1
  nodeg12:
      ip: 10.204.217.52
      provider: bms
      roles:
          analytics: null
          analytics_database: null
          config: null
          config_database: null
          control: null
          k8s_master: null
          kubemanager: null
          webui: null
  nodeg31:
      ip: 10.204.217.71
      provider: bms
      roles:
          analytics: null
          analytics_database: null
          config: null
          config_database: null
          control: null
          k8s_master: null
          kubemanager: null
          webui: null

contrail_configuration:
  CONTRAIL_VERSION: ocata-5.0-15
  CLOUD_ORCHESTRATOR: kubernetes
  METADATA_PROXY_SECRET: c0ntrail123
  CLOUD_ORCHESTRATOR: kubernetes
  CONTAINER_REGISTRY: 10.204.217.152:5000
  CONTRAIL_VERSION: ocata-5.0-15
  CONTROL_DATA_NET_LIST: 77.77.1.0/24
  CONTROLLER_NODES: 77.77.1.20,77.77.1.30,77.77.1.11
  KUBERNETES_PUBLIC_FIP_POOL:
      domain: default-domain
      name: __fip_pool_public__
      network: __public__
      project: default
  RABBITMQ_NODE_PORT: 5673
  REGISTRY_PRIVATE_INSECURE: true
  VROUTER_GATEWAY: 77.77.1.100

From: Andrey Pavlov <email address hidden>
Date: Tuesday, April 17, 2018 at 6:34 PM
To: Michael Henkel <email address hidden>
Cc: Pulkit Tandon <email address hidden>, Sachchidanand Vaidya <email address hidden>, Dinesh Bakiaraj <email address hidden>, Prasanna Mucharikar <email address hidden>, Yuvaraja Mariappan <email address hidden>, Aniket Gawade <email address hidden>, Sudheendra Rao <email address hidden>, Venkatesh Velpula <email address hidden>, Rudra Rugge <email address hidden>, Ignatious Johnson <email address hidden>
Subject: Re: Debugging required on k8s sanity setup which failed for R5.0-16

Hey Michael,

I have similar problems in my 3-nodes setup:

== Contrail control ==
control: active
nodemgr: active
named: active
dns: active

== Contrail analytics ==
snmp-collector: initializing (Database:Cassandra[] connection down)
query-engine: active
api: active
alarm-gen: initializing (Database:Cassandra[] connection down)
nodemgr: active
collector: initializing (Database:Cassandra connection down)
topology: initializing (Database:Cassandra[] connection down)

== Contrail config ==
api: initializing (Database:Cassandra[] connection down)
zookeeper: active
svc-monitor: backup
nodemgr: active
device-manager: backup
cassandra: active
rabbitmq: active
schema: backup

== Contrail webui ==
web: active
job: active

== Contrail database ==
kafka: active
nodemgr: active
zookeeper: active
cassandra: active

[root@node-10-1-56-124 ~]# free -hw
total used free shared buffers cache available
Mem: 15G 11G 3.3G 28M 0B 892M 3.7G
Swap: 0B 0B 0B

Regards,
Andrey Pavlov.

On Tue, Apr 17, 2018 at 3:57 PM, Michael Henkel <<email address hidden><mailto:<email address hidden>>> wrote:
Pulkit,

How many resources did you assign to your instances?
Regards,
Michael

Am 17.04.2018 um 05:37 schrieb Pulkit Tandon <<email address hidden><mailto:<email address hidden>>>:
Hi All,

I need your help and expertise debugging the k8s sanity setup which is in really bad state. Things are messier starting build 15.
I observed multiple problems on current attempt. Not sure if they are linked or all are different.
Kept the setup in same setup so that you can debug the failures on live setup.

K8s HA Setup details:
3 Controller+kube managers:
10.204.217.52(nodeg12)
10.204.217.71(nodeg31)
10.204.217.98(nodec58)
2 Agents/ k8s slave:
10.204.217.100(nodec60)
10.204.217.101(nodec61)
Multi interface setup

Following are key observations:

1. RabbitMQ cluster formed between nodeg12 and nodeg31. Nodec58 has rabbitmq as inactive.

rabbitmq: inactive

Docker logs for rabbitmq container on nodec58:

{"init terminating in do_boot",{error,{inconsistent_cluster,"Node contrail@nodec58 thinks it's clustered with node contrail@nodeg31, but contrail@nodeg31 disagrees"}}}

2. On all 3 controllers, Cassandra connection not established for 2 hours after provisioning. This issue seems flapping with time and sometimes, I see the services as active too:
control: initializing (Database:Cassandra connection down)
collector: initializing (Database:Cassandra connection down)

3. If I create a k8s Pod, many a times it results in POD creation failure and instantly vrouter crash happens.
The trace is below.
Irrespective of crash happens or not, POD creation fails

4. ON CNI of both agent, seeing this error:
I : 24646 : 2018/04/17 17:35:44 vrouter.go:79: VRouter request. Operation : GET Url : http://127.0.0.1:9091/vm/7a271412-4237-11e8-8997-002590c55f6a<https://urldefense.proofpoint.com/v2/url?u=http-3A__127.0.0.1-3A9091_vm_7a271412-2D4237-2D11e8-2D8997-2D002590c55f6a&d=DwMFaQ&c=HAkYuh63rsuhr6Scbfh0UjBXeMK-ndb3voDTXcWzoCI&r=Ua6iyp2AdFKOsN_QJy36M_iDQ1wUslstDZYVhqWGuQ8&m=A11nZcyKEwk9kTTjpZwfzYm-6biGA1i87MR_N4yT6ME&s=_Rc_RlMfhaawCCLdTbsGmZ2017P8f2keiUKRURB_07k&e=>

E : 24646 : 2018/04/17 17:35:44 vrouter.go:147: Failed HTTP Get operation. Return code 404

I : 24646 : 2018/04/17 17:35:44 vrouter.go:181: Iteration 14 : Get vrouter failed

E : 24633 : 2018/04/17 17:35:49 vrouter.go:287: Error in polling VRouter

I : 24633 : 2018/04/17 17:35:49 cni.go:175: Error in Add to VRouter

E : 24633 : 2018/04/17 17:35:49 contrail-kube-cni.go:67: Failed processing Add command.

E : 24646 : 2018/04/17 17:35:49 vrouter.go:287: Error in polling VRouter

I : 24646 : 2018/04/17 17:35:49 cni.go:175: Error in Add to VRouter

E : 24646 : 2018/04/17 17:35:49 contrail-kube-cni.go:67: Failed processing Add command.

NOTE: Most of the issues observed are on k8s HA multi interface setup.
Things are better with Non HA/ single interface setup.

Agent crash trace:
(gdb) bt full
#0 0x00007fb9817761f7 in raise () from /lib64/libc.so.6
No symbol table info available.
#1 0x00007fb9817778e8 in abort () from /lib64/libc.so.6
No symbol table info available.
#2 0x00007fb98176f266 in __assert_fail_base () from /lib64/libc.so.6
No symbol table info available.
#3 0x00007fb98176f312 in __assert_fail () from /lib64/libc.so.6
No symbol table info available.
#4 0x0000000000c15440 in AgentOperDBTable::ConfigEventHandler(IFMapNode*, DBEntry*) ()
No symbol table info available.
#5 0x0000000000c41714 in IFMapDependencyManager::ProcessChangeList() ()
No symbol table info available.
#6 0x0000000000ea4a57 in TaskTrigger::WorkerTask::Run() ()
No symbol table info available.
#7 0x0000000000e9e64f in TaskImpl::execute() ()
No symbol table info available.
#8 0x00007fb9823458ca in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all(tbb::task&, tbb::task*) () from /lib64/libtbb.so.2
No symbol table info available.
#9 0x00007fb9823415b6 in tbb::internal::arena::process(tbb::internal::generic_scheduler&) () from /lib64/libtbb.so.2
No symbol table info available.
#10 0x00007fb982340c8b in tbb::internal::market::process(rml::job&) () from /lib64/libtbb.so.2
No symbol table info available.
#11 0x00007fb98233e67f in tbb::internal::rml::private_worker::run() () from /lib64/libtbb.so.2
No symbol table info available.
#12 0x00007fb98233e879 in tbb::internal::rml::private_worker::thread_routine(void*) () from /lib64/libtbb.so.2
No symbol table info available.
#13 0x00007fb982560e25 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#14 0x00007fb98183934d in clone () from /lib64/libc.so.6

Thanks!
Pulkit Tandon

Hi Michael,

I did not explicitly mention any resources to instances.

Following is  my instances.yaml:

From: Andrey Pavlov <andrey.mp@gmail.com>
Date: Tuesday, April 17, 2018 at 6:34 PM
To: Michael Henkel <mhenkel@juniper.net>
Cc: Pulkit Tandon <pulkitt@juniper.net>, Sachchidanand Vaidya <vaidyasd@juniper.net>, Dinesh Bakiaraj <1764493@bugs.launchpad.net>, Prasanna Mucharikar <mprasanna@juniper.net>, Yuvaraja Mariappan <ymariappan@juniper.net>, Aniket Gawade <aniketg@juniper.net>, Sudheendra Rao <sudheendra@juniper.net>, Venkatesh Velpula <vvelpula@juniper.net>, Rudra Rugge <rrugge@juniper.net>, Ignatious Johnson <ijohnson@juniper.net>
Subject: Re: Debugging required on k8s sanity setup which failed for R5.0-16

Hey Michael,

I have similar problems in my 3-nodes setup:

== Contrail control ==
control: active
nodemgr: active
named: active
dns: active

== Contrail webui ==
web: active
job: active

== Contrail database ==
kafka: active
nodemgr: active
zookeeper: active
cassandra: active

[root@node-10-1-56-124 ~]# free -hw
              total        used        free      shared     buffers       cache   available
Mem:            15G         11G        3.3G         28M          0B        892M        3.7G
Swap:            0B          0B          0B

Regards,
Andrey Pavlov.

On Tue, Apr 17, 2018 at 3:57 PM, Michael Henkel <mhenkel@juniper.net<mailto:mhenkel@juniper.net>> wrote:
Pulkit,

How many resources did you assign to your instances?
Regards,
Michael

Am 17.04.2018 um 05:37 schrieb Pulkit Tandon <pulkitt@juniper.net<mailto:pulkitt@juniper.net>>:
Hi All,

Following are key observations:

1.       RabbitMQ cluster formed between nodeg12 and nodeg31. Nodec58 has rabbitmq as inactive.

rabbitmq: inactive

Docker logs for rabbitmq container on nodec58:

{"init terminating in do_boot",{error,{inconsistent_cluster,"Node contrail@nodec58 thinks it's clustered with node contrail@nodeg31, but contrail@nodeg31 disagrees"}}}

2.       On all 3 controllers, Cassandra connection not established for 2 hours after provisioning. This issue seems flapping with time and sometimes, I see the services as active too:
control: initializing (Database:Cassandra connection down)
collector: initializing (Database:Cassandra connection down)

3.       If I create a k8s  Pod, many a times it results in POD creation failure and instantly vrouter crash happens.
The trace is below.
Irrespective of crash happens or not, POD creation fails

4.       ON CNI of both agent, seeing this error:
I : 24646 : 2018/04/17 17:35:44 vrouter.go:79: VRouter request. Operation : GET Url :  http://127.0.0.1:9091/vm/7a271412-4237-11e8-8997-002590c55f6a<https://urldefense.proofpoint.com/v2/url?u=http-3A__127.0.0.1-3A9091_vm_7a271412-2D4237-2D11e8-2D8997-2D002590c55f6a&d=DwMFaQ&c=HAkYuh63rsuhr6Scbfh0UjBXeMK-ndb3voDTXcWzoCI&r=Ua6iyp2AdFKOsN_QJy36M_iDQ1wUslstDZYVhqWGuQ8&m=A11nZcyKEwk9kTTjpZwfzYm-6biGA1i87MR_N4yT6ME&s=_Rc_RlMfhaawCCLdTbsGmZ2017P8f2keiUKRURB_07k&e=>