== Contrail control ==
control: active
nodemgr: active
named: active
dns: active
== Contrail analytics ==
snmp-collector: initializing (Database:Cassandra[] connection down)
query-engine: active
api: active
alarm-gen: initializing (Database:Cassandra[] connection down)
nodemgr: active
collector: initializing (Database:Cassandra connection down)
topology: initializing (Database:Cassandra[] connection down)
== Contrail config ==
api: initializing (Database:Cassandra[] connection down)
zookeeper: active
svc-monitor: backup
nodemgr: active
device-manager: backup
cassandra: active
rabbitmq: active
schema: backup
== Contrail webui ==
web: active
job: active
== Contrail database ==
kafka: active
nodemgr: active
zookeeper: active
cassandra: active
[root@node-10-1-56-124 ~]# free -hw
total used free shared buffers cache available
Mem: 15G 11G 3.3G 28M 0B 892M 3.7G
Swap: 0B 0B 0B
Regards,
Andrey Pavlov.
On Tue, Apr 17, 2018 at 3:57 PM, Michael Henkel <<email address hidden><mailto:<email address hidden>>> wrote:
Pulkit,
How many resources did you assign to your instances?
Regards,
Michael
Am 17.04.2018 um 05:37 schrieb Pulkit Tandon <<email address hidden><mailto:<email address hidden>>>:
Hi All,
I need your help and expertise debugging the k8s sanity setup which is in really bad state. Things are messier starting build 15.
I observed multiple problems on current attempt. Not sure if they are linked or all are different.
Kept the setup in same setup so that you can debug the failures on live setup.
K8s HA Setup details:
3 Controller+kube managers:
10.204.217.52(nodeg12)
10.204.217.71(nodeg31)
10.204.217.98(nodec58)
2 Agents/ k8s slave:
10.204.217.100(nodec60)
10.204.217.101(nodec61)
Multi interface setup
Following are key observations:
1. RabbitMQ cluster formed between nodeg12 and nodeg31. Nodec58 has rabbitmq as inactive.
rabbitmq: inactive
Docker logs for rabbitmq container on nodec58:
{"init terminating in do_boot",{error,{inconsistent_cluster,"Node contrail@nodec58 thinks it's clustered with node contrail@nodeg31, but contrail@nodeg31 disagrees"}}}
2. On all 3 controllers, Cassandra connection not established for 2 hours after provisioning. This issue seems flapping with time and sometimes, I see the services as active too:
control: initializing (Database:Cassandra connection down)
collector: initializing (Database:Cassandra connection down)
3. If I create a k8s Pod, many a times it results in POD creation failure and instantly vrouter crash happens.
The trace is below.
Irrespective of crash happens or not, POD creation fails
NOTE: Most of the issues observed are on k8s HA multi interface setup.
Things are better with Non HA/ single interface setup.
Agent crash trace:
(gdb) bt full
#0 0x00007fb9817761f7 in raise () from /lib64/libc.so.6
No symbol table info available.
#1 0x00007fb9817778e8 in abort () from /lib64/libc.so.6
No symbol table info available.
#2 0x00007fb98176f266 in __assert_fail_base () from /lib64/libc.so.6
No symbol table info available.
#3 0x00007fb98176f312 in __assert_fail () from /lib64/libc.so.6
No symbol table info available.
#4 0x0000000000c15440 in AgentOperDBTable::ConfigEventHandler(IFMapNode*, DBEntry*) ()
No symbol table info available.
#5 0x0000000000c41714 in IFMapDependencyManager::ProcessChangeList() ()
No symbol table info available.
#6 0x0000000000ea4a57 in TaskTrigger::WorkerTask::Run() ()
No symbol table info available.
#7 0x0000000000e9e64f in TaskImpl::execute() ()
No symbol table info available.
#8 0x00007fb9823458ca in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all(tbb::task&, tbb::task*) () from /lib64/libtbb.so.2
No symbol table info available.
#9 0x00007fb9823415b6 in tbb::internal::arena::process(tbb::internal::generic_scheduler&) () from /lib64/libtbb.so.2
No symbol table info available.
#10 0x00007fb982340c8b in tbb::internal::market::process(rml::job&) () from /lib64/libtbb.so.2
No symbol table info available.
#11 0x00007fb98233e67f in tbb::internal::rml::private_worker::run() () from /lib64/libtbb.so.2
No symbol table info available.
#12 0x00007fb98233e879 in tbb::internal::rml::private_worker::thread_routine(void*) () from /lib64/libtbb.so.2
No symbol table info available.
#13 0x00007fb982560e25 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#14 0x00007fb98183934d in clone () from /lib64/libc.so.6
Hi Michael,
I did not explicitly mention any resources to instances.
Following is my instances.yaml:
global_ configuration: PRIVATE_ INSECURE: True REGISTRY: 10.204.217.152:5000
REGISTRY_
CONTAINER_
provider_config:
bms:
domainsuffix: englab.juniper.net
ntpserver: 10.204.217.158
ssh_pwd: c0ntrail123
ssh_user: root
instances:
analytics: null
analytics_ database: null
config_ database: null
k8s_ master: null
kubemanager: null
PHYSICAL_ INTERFACE: bond0
PHYSICAL_ INTERFACE: enp2s0f1
analytics: null
analytics_ database: null
config_ database: null
k8s_ master: null
kubemanager: null
analytics: null
analytics_ database: null
config_ database: null
k8s_ master: null
kubemanager: null
nodec58:
ip: 10.204.217.98
provider: bms
roles:
config: null
control: null
webui: null
nodec60:
ip: 10.204.217.100
provider: bms
roles:
k8s_node: null
vrouter:
nodec61:
ip: 10.204.217.101
provider: bms
roles:
k8s_node: null
vrouter:
nodeg12:
ip: 10.204.217.52
provider: bms
roles:
config: null
control: null
webui: null
nodeg31:
ip: 10.204.217.71
provider: bms
roles:
config: null
control: null
webui: null
contrail_ configuration: ORCHESTRATOR: kubernetes PROXY_SECRET: c0ntrail123 ORCHESTRATOR: kubernetes REGISTRY: 10.204.217.152:5000 DATA_NET_ LIST: 77.77.1.0/24 20,77.77. 1.30,77. 77.1.11 PUBLIC_ FIP_POOL: NODE_PORT: 5673 PRIVATE_ INSECURE: true
CONTRAIL_VERSION: ocata-5.0-15
CLOUD_
METADATA_
CLOUD_
CONTAINER_
CONTRAIL_VERSION: ocata-5.0-15
CONTROL_
CONTROLLER_NODES: 77.77.1.
KUBERNETES_
domain: default-domain
name: __fip_pool_public__
network: __public__
project: default
RABBITMQ_
REGISTRY_
VROUTER_GATEWAY: 77.77.1.100
From: Andrey Pavlov <email address hidden>
Date: Tuesday, April 17, 2018 at 6:34 PM
To: Michael Henkel <email address hidden>
Cc: Pulkit Tandon <email address hidden>, Sachchidanand Vaidya <email address hidden>, Dinesh Bakiaraj <email address hidden>, Prasanna Mucharikar <email address hidden>, Yuvaraja Mariappan <email address hidden>, Aniket Gawade <email address hidden>, Sudheendra Rao <email address hidden>, Venkatesh Velpula <email address hidden>, Rudra Rugge <email address hidden>, Ignatious Johnson <email address hidden>
Subject: Re: Debugging required on k8s sanity setup which failed for R5.0-16
Hey Michael,
I have similar problems in my 3-nodes setup:
== Contrail control ==
control: active
nodemgr: active
named: active
dns: active
== Contrail analytics == Cassandra[ ] connection down) Cassandra[ ] connection down) Cassandra[ ] connection down)
snmp-collector: initializing (Database:
query-engine: active
api: active
alarm-gen: initializing (Database:
nodemgr: active
collector: initializing (Database:Cassandra connection down)
topology: initializing (Database:
== Contrail config == Cassandra[ ] connection down)
api: initializing (Database:
zookeeper: active
svc-monitor: backup
nodemgr: active
device-manager: backup
cassandra: active
rabbitmq: active
schema: backup
== Contrail webui ==
web: active
job: active
== Contrail database ==
kafka: active
nodemgr: active
zookeeper: active
cassandra: active
[root@node- 10-1-56- 124 ~]# free -hw
total used free shared buffers cache available
Mem: 15G 11G 3.3G 28M 0B 892M 3.7G
Swap: 0B 0B 0B
Regards,
Andrey Pavlov.
On Tue, Apr 17, 2018 at 3:57 PM, Michael Henkel <<email address hidden> <mailto: <email address hidden>>> wrote:
Pulkit,
How many resources did you assign to your instances?
Regards,
Michael
Am 17.04.2018 um 05:37 schrieb Pulkit Tandon <<email address hidden> <mailto: <email address hidden>>>:
Hi All,
I need your help and expertise debugging the k8s sanity setup which is in really bad state. Things are messier starting build 15.
I observed multiple problems on current attempt. Not sure if they are linked or all are different.
Kept the setup in same setup so that you can debug the failures on live setup.
K8s HA Setup details: 217.52( nodeg12) 217.71( nodeg31) 217.98( nodec58) 217.100( nodec60) 217.101( nodec61)
3 Controller+kube managers:
10.204.
10.204.
10.204.
2 Agents/ k8s slave:
10.204.
10.204.
Multi interface setup
Following are key observations:
1. RabbitMQ cluster formed between nodeg12 and nodeg31. Nodec58 has rabbitmq as inactive.
rabbitmq: inactive
Docker logs for rabbitmq container on nodec58:
{"init terminating in do_boot" ,{error, {inconsistent_ cluster, "Node contrail@nodec58 thinks it's clustered with node contrail@nodeg31, but contrail@nodeg31 disagrees"}}}
2. On all 3 controllers, Cassandra connection not established for 2 hours after provisioning. This issue seems flapping with time and sometimes, I see the services as active too:
control: initializing (Database:Cassandra connection down)
collector: initializing (Database:Cassandra connection down)
3. If I create a k8s Pod, many a times it results in POD creation failure and instantly vrouter crash happens.
The trace is below.
Irrespective of crash happens or not, POD creation fails
4. ON CNI of both agent, seeing this error: 127.0.0. 1:9091/ vm/7a271412- 4237-11e8- 8997-002590c55f 6a<https:/ /urldefense. proofpoint. com/v2/ url?u=http- 3A__127. 0.0.1-3A9091_ vm_7a271412- 2D4237- 2D11e8- 2D8997- 2D002590c55f6a& d=DwMFaQ& c=HAkYuh63rsuhr 6Scbfh0UjBXeMK- ndb3voDTXcWzoCI &r=Ua6iyp2AdFKO sN_QJy36M_ iDQ1wUslstDZYVh qWGuQ8& m=A11nZcyKEwk9k TTjpZwfzYm- 6biGA1i87MR_ N4yT6ME& s=_Rc_RlMfhaawC CLdTbsGmZ2017P8 f2keiUKRURB_ 07k&e=>
I : 24646 : 2018/04/17 17:35:44 vrouter.go:79: VRouter request. Operation : GET Url : http://
E : 24646 : 2018/04/17 17:35:44 vrouter.go:147: Failed HTTP Get operation. Return code 404
I : 24646 : 2018/04/17 17:35:44 vrouter.go:181: Iteration 14 : Get vrouter failed
E : 24633 : 2018/04/17 17:35:49 vrouter.go:287: Error in polling VRouter
I : 24633 : 2018/04/17 17:35:49 cni.go:175: Error in Add to VRouter
E : 24633 : 2018/04/17 17:35:49 contrail- kube-cni. go:67: Failed processing Add command.
E : 24646 : 2018/04/17 17:35:49 vrouter.go:287: Error in polling VRouter
I : 24646 : 2018/04/17 17:35:49 cni.go:175: Error in Add to VRouter
E : 24646 : 2018/04/17 17:35:49 contrail- kube-cni. go:67: Failed processing Add command.
NOTE: Most of the issues observed are on k8s HA multi interface setup.
Things are better with Non HA/ single interface setup.
Agent crash trace: e::ConfigEventH andler( IFMapNode* , DBEntry*) () Manager: :ProcessChangeL ist() () :WorkerTask: :Run() () :custom_ scheduler< tbb::internal: :IntelScheduler Traits> ::local_ wait_for_ all(tbb: :task&, tbb::task*) () from /lib64/libtbb.so.2 :arena: :process( tbb::internal: :generic_ scheduler& ) () from /lib64/libtbb.so.2 :market: :process( rml::job& ) () from /lib64/libtbb.so.2 :rml::private_ worker: :run() () from /lib64/libtbb.so.2 :rml::private_ worker: :thread_ routine( void*) () from /lib64/libtbb.so.2 libpthread. so.0
(gdb) bt full
#0 0x00007fb9817761f7 in raise () from /lib64/libc.so.6
No symbol table info available.
#1 0x00007fb9817778e8 in abort () from /lib64/libc.so.6
No symbol table info available.
#2 0x00007fb98176f266 in __assert_fail_base () from /lib64/libc.so.6
No symbol table info available.
#3 0x00007fb98176f312 in __assert_fail () from /lib64/libc.so.6
No symbol table info available.
#4 0x0000000000c15440 in AgentOperDBTabl
No symbol table info available.
#5 0x0000000000c41714 in IFMapDependency
No symbol table info available.
#6 0x0000000000ea4a57 in TaskTrigger:
No symbol table info available.
#7 0x0000000000e9e64f in TaskImpl::execute() ()
No symbol table info available.
#8 0x00007fb9823458ca in tbb::internal:
No symbol table info available.
#9 0x00007fb9823415b6 in tbb::internal:
No symbol table info available.
#10 0x00007fb982340c8b in tbb::internal:
No symbol table info available.
#11 0x00007fb98233e67f in tbb::internal:
No symbol table info available.
#12 0x00007fb98233e879 in tbb::internal:
No symbol table info available.
#13 0x00007fb982560e25 in start_thread () from /lib64/
No symbol table info available.
#14 0x00007fb98183934d in clone () from /lib64/libc.so.6
Thanks!
Pulkit Tandon