Am 17.04.2018 um 06:04 schrieb Andrey Pavlov <<email address hidden><mailto:<email address hidden>>>:
Hey Michael,
I have similar problems in my 3-nodes setup:
== Contrail control ==
control: active
nodemgr: active
named: active
dns: active
== Contrail analytics ==
snmp-collector: initializing (Database:Cassandra[] connection down)
query-engine: active
api: active
alarm-gen: initializing (Database:Cassandra[] connection down)
nodemgr: active
collector: initializing (Database:Cassandra connection down)
topology: initializing (Database:Cassandra[] connection down)
== Contrail config ==
api: initializing (Database:Cassandra[] connection down)
zookeeper: active
svc-monitor: backup
nodemgr: active
device-manager: backup
cassandra: active
rabbitmq: active
schema: backup
== Contrail webui ==
web: active
job: active
== Contrail database ==
kafka: active
nodemgr: active
zookeeper: active
cassandra: active
[root@node-10-1-56-124 ~]# free -hw
total used free shared buffers cache available
Mem: 15G 11G 3.3G 28M 0B 892M 3.7G
Swap: 0B 0B 0B
Regards,
Andrey Pavlov.
On Tue, Apr 17, 2018 at 3:57 PM, Michael Henkel <<email address hidden><mailto:<email address hidden>>> wrote:
Pulkit,
How many resources did you assign to your instances?
Regards,
Michael
Am 17.04.2018 um 05:37 schrieb Pulkit Tandon <<email address hidden><mailto:<email address hidden>>>:
Hi All,
I need your help and expertise debugging the k8s sanity setup which is in really bad state. Things are messier starting build 15.
I observed multiple problems on current attempt. Not sure if they are linked or all are different.
Kept the setup in same setup so that you can debug the failures on live setup.
K8s HA Setup details:
3 Controller+kube managers:
10.204.217.52(nodeg12)
10.204.217.71(nodeg31)
10.204.217.98(nodec58)
2 Agents/ k8s slave:
10.204.217.100(nodec60)
10.204.217.101(nodec61)
Multi interface setup
Following are key observations:
1. RabbitMQ cluster formed between nodeg12 and nodeg31. Nodec58 has rabbitmq as inactive.
rabbitmq: inactive
Docker logs for rabbitmq container on nodec58:
{"init terminating in do_boot",{error,{inconsistent_cluster,"Node contrail@nodec58 thinks it's clustered with node contrail@nodeg31, but contrail@nodeg31 disagrees"}}}
2. On all 3 controllers, Cassandra connection not established for 2 hours after provisioning. This issue seems flapping with time and sometimes, I see the services as active too:
control: initializing (Database:Cassandra connection down)
collector: initializing (Database:Cassandra connection down)
3. If I create a k8s Pod, many a times it results in POD creation failure and instantly vrouter crash happens.
The trace is below.
Irrespective of crash happens or not, POD creation fails
NOTE: Most of the issues observed are on k8s HA multi interface setup.
Things are better with Non HA/ single interface setup.
Agent crash trace:
(gdb) bt full
#0 0x00007fb9817761f7 in raise () from /lib64/libc.so.6
No symbol table info available.
#1 0x00007fb9817778e8 in abort () from /lib64/libc.so.6
No symbol table info available.
#2 0x00007fb98176f266 in __assert_fail_base () from /lib64/libc.so.6
No symbol table info available.
#3 0x00007fb98176f312 in __assert_fail () from /lib64/libc.so.6
No symbol table info available.
#4 0x0000000000c15440 in AgentOperDBTable::ConfigEventHandler(IFMapNode*, DBEntry*) ()
No symbol table info available.
#5 0x0000000000c41714 in IFMapDependencyManager::ProcessChangeList() ()
No symbol table info available.
#6 0x0000000000ea4a57 in TaskTrigger::WorkerTask::Run() ()
No symbol table info available.
#7 0x0000000000e9e64f in TaskImpl::execute() ()
No symbol table info available.
#8 0x00007fb9823458ca in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all(tbb::task&, tbb::task*) () from /lib64/libtbb.so.2
No symbol table info available.
#9 0x00007fb9823415b6 in tbb::internal::arena::process(tbb::internal::generic_scheduler&) () from /lib64/libtbb.so.2
No symbol table info available.
#10 0x00007fb982340c8b in tbb::internal::market::process(rml::job&) () from /lib64/libtbb.so.2
No symbol table info available.
#11 0x00007fb98233e67f in tbb::internal::rml::private_worker::run() () from /lib64/libtbb.so.2
No symbol table info available.
#12 0x00007fb98233e879 in tbb::internal::rml::private_worker::thread_routine(void*) () from /lib64/libtbb.so.2
No symbol table info available.
#13 0x00007fb982560e25 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#14 0x00007fb98183934d in clone () from /lib64/libc.so.6
Hi Andrey, did you check nodetool status?
Regards,
Michael
Am 17.04.2018 um 06:04 schrieb Andrey Pavlov <<email address hidden> <mailto: <email address hidden>>>:
Hey Michael,
I have similar problems in my 3-nodes setup:
== Contrail control ==
control: active
nodemgr: active
named: active
dns: active
== Contrail analytics == Cassandra[ ] connection down) Cassandra[ ] connection down) Cassandra[ ] connection down)
snmp-collector: initializing (Database:
query-engine: active
api: active
alarm-gen: initializing (Database:
nodemgr: active
collector: initializing (Database:Cassandra connection down)
topology: initializing (Database:
== Contrail config == Cassandra[ ] connection down)
api: initializing (Database:
zookeeper: active
svc-monitor: backup
nodemgr: active
device-manager: backup
cassandra: active
rabbitmq: active
schema: backup
== Contrail webui ==
web: active
job: active
== Contrail database ==
kafka: active
nodemgr: active
zookeeper: active
cassandra: active
[root@node- 10-1-56- 124 ~]# free -hw
total used free shared buffers cache available
Mem: 15G 11G 3.3G 28M 0B 892M 3.7G
Swap: 0B 0B 0B
Regards,
Andrey Pavlov.
On Tue, Apr 17, 2018 at 3:57 PM, Michael Henkel <<email address hidden> <mailto: <email address hidden>>> wrote:
Pulkit,
How many resources did you assign to your instances?
Regards,
Michael
Am 17.04.2018 um 05:37 schrieb Pulkit Tandon <<email address hidden> <mailto: <email address hidden>>>:
Hi All,
I need your help and expertise debugging the k8s sanity setup which is in really bad state. Things are messier starting build 15.
I observed multiple problems on current attempt. Not sure if they are linked or all are different.
Kept the setup in same setup so that you can debug the failures on live setup.
K8s HA Setup details: 217.52( nodeg12) 217.71( nodeg31) 217.98( nodec58) 217.100( nodec60) 217.101( nodec61)
3 Controller+kube managers:
10.204.
10.204.
10.204.
2 Agents/ k8s slave:
10.204.
10.204.
Multi interface setup
Following are key observations:
1. RabbitMQ cluster formed between nodeg12 and nodeg31. Nodec58 has rabbitmq as inactive.
rabbitmq: inactive
Docker logs for rabbitmq container on nodec58:
{"init terminating in do_boot" ,{error, {inconsistent_ cluster, "Node contrail@nodec58 thinks it's clustered with node contrail@nodeg31, but contrail@nodeg31 disagrees"}}}
2. On all 3 controllers, Cassandra connection not established for 2 hours after provisioning. This issue seems flapping with time and sometimes, I see the services as active too:
control: initializing (Database:Cassandra connection down)
collector: initializing (Database:Cassandra connection down)
3. If I create a k8s Pod, many a times it results in POD creation failure and instantly vrouter crash happens.
The trace is below.
Irrespective of crash happens or not, POD creation fails
4. ON CNI of both agent, seeing this error: 127.0.0. 1:9091/ vm/7a271412- 4237-11e8- 8997-002590c55f 6a<https:/ /urldefense. proofpoint. com/v2/ url?u=http- 3A__127. 0.0.1-3A9091_ vm_7a271412- 2D4237- 2D11e8- 2D8997- 2D002590c55f6a& d=DwMFaQ& c=HAkYuh63rsuhr 6Scbfh0UjBXeMK- ndb3voDTXcWzoCI &r=IIpdzrKFE- fFt447an76T47XL h_Zf5gZoVC_ UG0ewoQ& m=yFx0bC6UO1iMy sgd4gZj98BAqXoW Pfbe8j62INXaNIc &s=evcN3Jb- APwu9GUfDx2SAUS xXMlvx8vb6qBYW6 zU-RE&e=>
I : 24646 : 2018/04/17 17:35:44 vrouter.go:79: VRouter request. Operation : GET Url : http://
E : 24646 : 2018/04/17 17:35:44 vrouter.go:147: Failed HTTP Get operation. Return code 404
I : 24646 : 2018/04/17 17:35:44 vrouter.go:181: Iteration 14 : Get vrouter failed
E : 24633 : 2018/04/17 17:35:49 vrouter.go:287: Error in polling VRouter
I : 24633 : 2018/04/17 17:35:49 cni.go:175: Error in Add to VRouter
E : 24633 : 2018/04/17 17:35:49 contrail- kube-cni. go:67: Failed processing Add command.
E : 24646 : 2018/04/17 17:35:49 vrouter.go:287: Error in polling VRouter
I : 24646 : 2018/04/17 17:35:49 cni.go:175: Error in Add to VRouter
E : 24646 : 2018/04/17 17:35:49 contrail- kube-cni. go:67: Failed processing Add command.
NOTE: Most of the issues observed are on k8s HA multi interface setup.
Things are better with Non HA/ single interface setup.
Agent crash trace: e::ConfigEventH andler( IFMapNode* , DBEntry*) () Manager: :ProcessChangeL ist() () :WorkerTask: :Run() () :custom_ scheduler< tbb::internal: :IntelScheduler Traits> ::local_ wait_for_ all(tbb: :task&, tbb::task*) () from /lib64/libtbb.so.2 :arena: :process( tbb::internal: :generic_ scheduler& ) () from /lib64/libtbb.so.2 :market: :process( rml::job& ) () from /lib64/libtbb.so.2 :rml::private_ worker: :run() () from /lib64/libtbb.so.2 :rml::private_ worker: :thread_ routine( void*) () from /lib64/libtbb.so.2 libpthread. so.0
(gdb) bt full
#0 0x00007fb9817761f7 in raise () from /lib64/libc.so.6
No symbol table info available.
#1 0x00007fb9817778e8 in abort () from /lib64/libc.so.6
No symbol table info available.
#2 0x00007fb98176f266 in __assert_fail_base () from /lib64/libc.so.6
No symbol table info available.
#3 0x00007fb98176f312 in __assert_fail () from /lib64/libc.so.6
No symbol table info available.
#4 0x0000000000c15440 in AgentOperDBTabl
No symbol table info available.
#5 0x0000000000c41714 in IFMapDependency
No symbol table info available.
#6 0x0000000000ea4a57 in TaskTrigger:
No symbol table info available.
#7 0x0000000000e9e64f in TaskImpl::execute() ()
No symbol table info available.
#8 0x00007fb9823458ca in tbb::internal:
No symbol table info available.
#9 0x00007fb9823415b6 in tbb::internal:
No symbol table info available.
#10 0x00007fb982340c8b in tbb::internal:
No symbol table info available.
#11 0x00007fb98233e67f in tbb::internal:
No symbol table info available.
#12 0x00007fb98233e879 in tbb::internal:
No symbol table info available.
#13 0x00007fb982560e25 in start_thread () from /lib64/
No symbol table info available.
#14 0x00007fb98183934d in clone () from /lib64/libc.so.6
Thanks!
Pulkit Tandon