Comment 6 for bug 1764493

Revision history for this message
Andrey Pavlov (apavlov-e) wrote : Re: Debugging required on k8s sanity setup which failed for R5.0-16

Hey Michael,

I have similar problems in my 3-nodes setup:

== Contrail control ==
control: active
nodemgr: active
named: active
dns: active

== Contrail analytics ==
snmp-collector: initializing (Database:Cassandra[] connection down)
query-engine: active
api: active
alarm-gen: initializing (Database:Cassandra[] connection down)
nodemgr: active
collector: initializing (Database:Cassandra connection down)
topology: initializing (Database:Cassandra[] connection down)

== Contrail config ==
api: initializing (Database:Cassandra[] connection down)
zookeeper: active
svc-monitor: backup
nodemgr: active
device-manager: backup
cassandra: active
rabbitmq: active
schema: backup

== Contrail webui ==
web: active
job: active

== Contrail database ==
kafka: active
nodemgr: active
zookeeper: active
cassandra: active

[root@node-10-1-56-124 ~]# free -hw
              total used free shared buffers
 cache available
Mem: 15G 11G 3.3G 28M 0B
892M 3.7G
Swap: 0B 0B 0B

Regards,
Andrey Pavlov.

On Tue, Apr 17, 2018 at 3:57 PM, Michael Henkel <email address hidden> wrote:

> Pulkit,
>
> How many resources did you assign to your instances?
>
> Regards,
> Michael
>
> Am 17.04.2018 um 05:37 schrieb Pulkit Tandon <email address hidden>:
>
> Hi All,
>
>
>
> I need your help and expertise debugging the k8s sanity setup which is in
> really bad state. Things are messier starting build 15.
>
> I observed multiple problems on current attempt. Not sure if they are
> linked or all are different.
>
> Kept the setup in same setup so that you can debug the failures on live
> setup.
>
>
>
> *K8s HA Setup details:*
>
> 3 Controller+kube managers:
>
> 10.204.217.52(nodeg12)
>
> 10.204.217.71(nodeg31)
>
> 10.204.217.98(nodec58)
>
> 2 Agents/ k8s slave:
>
> 10.204.217.100(nodec60)
>
> 10.204.217.101(nodec61)
>
> Multi interface setup
>
>
>
> Following are key observations:
>
> 1. RabbitMQ cluster formed between nodeg12 and nodeg31. Nodec58 has
> rabbitmq as inactive.
>
> rabbitmq: inactive
>
> Docker logs for rabbitmq container on nodec58:
>
> {"init terminating in do_boot",{error,{inconsistent_cluster,"Node
> contrail@nodec58 thinks it's clustered with node contrail@nodeg31, but
> contrail@nodeg31 disagrees"}}}
>
>
>
> 2. On all 3 controllers, Cassandra connection not established for 2
> hours after provisioning. This issue seems flapping with time and
> sometimes, I see the services as active too:
> control: initializing (Database:Cassandra connection down)
> collector: initializing (Database:Cassandra connection down)
>
>
>
> 3. If I create a k8s Pod, many a times it results in POD creation
> failure and instantly vrouter crash happens.
> The trace is below.
> Irrespective of crash happens or not, POD creation fails
>
>
>
> 4. ON CNI of both agent, seeing this error:
> I : 24646 : 2018/04/17 17:35:44 vrouter.go:79: VRouter request. Operation
> : GET Url : http://127.0.0.1:9091/vm/7a271412-4237-11e8-8997-002590c55f6a
>
> E : 24646 : 2018/04/17 17:35:44 vrouter.go:147: Failed HTTP Get operation.
> Return code 404
>
> I : 24646 : 2018/04/17 17:35:44 vrouter.go:181: Iteration 14 : Get vrouter
> failed
>
> E : 24633 : 2018/04/17 17:35:49 vrouter.go:287: Error in polling VRouter
>
> I : 24633 : 2018/04/17 17:35:49 cni.go:175: Error in Add to VRouter
>
> E : 24633 : 2018/04/17 17:35:49 contrail-kube-cni.go:67: Failed processing
> Add command.
>
> E : 24646 : 2018/04/17 17:35:49 vrouter.go:287: Error in polling VRouter
>
> I : 24646 : 2018/04/17 17:35:49 cni.go:175: Error in Add to VRouter
>
> E : 24646 : 2018/04/17 17:35:49 contrail-kube-cni.go:67: Failed processing
> Add command.
>
>
>
> *NOTE: Most of the issues observed are on k8s HA multi interface setup.*
>
> * Things are better with Non HA/ single interface setup. *
>
>
>
>
>
> Agent crash trace:
>
> (gdb) bt full
>
> #0 0x00007fb9817761f7 in raise () from /lib64/libc.so.6
>
> No symbol table info available.
>
> #1 0x00007fb9817778e8 in abort () from /lib64/libc.so.6
>
> No symbol table info available.
>
> #2 0x00007fb98176f266 in __assert_fail_base () from /lib64/libc.so.6
>
> No symbol table info available.
>
> #3 0x00007fb98176f312 in __assert_fail () from /lib64/libc.so.6
>
> No symbol table info available.
>
> #4 0x0000000000c15440 in AgentOperDBTable::ConfigEventHandler(IFMapNode*,
> DBEntry*) ()
>
> No symbol table info available.
>
> #5 0x0000000000c41714 in IFMapDependencyManager::ProcessChangeList() ()
>
> No symbol table info available.
>
> #6 0x0000000000ea4a57 in TaskTrigger::WorkerTask::Run() ()
>
> No symbol table info available.
>
> #7 0x0000000000e9e64f in TaskImpl::execute() ()
>
> No symbol table info available.
>
> #8 0x00007fb9823458ca in tbb::internal::custom_scheduler<tbb::internal::
> IntelSchedulerTraits>::local_wait_for_all(tbb::task&, tbb::task*) () from
> /lib64/libtbb.so.2
>
> No symbol table info available.
>
> #9 0x00007fb9823415b6 in tbb::internal::arena::process(
> tbb::internal::generic_scheduler&) () from /lib64/libtbb.so.2
>
> No symbol table info available.
>
> #10 0x00007fb982340c8b in tbb::internal::market::process(rml::job&) ()
> from /lib64/libtbb.so.2
>
> No symbol table info available.
>
> #11 0x00007fb98233e67f in tbb::internal::rml::private_worker::run() ()
> from /lib64/libtbb.so.2
>
> No symbol table info available.
>
> #12 0x00007fb98233e879 in tbb::internal::rml::private_worker::thread_routine(void*)
> () from /lib64/libtbb.so.2
>
> No symbol table info available.
>
> #13 0x00007fb982560e25 in start_thread () from /lib64/libpthread.so.0
>
> No symbol table info available.
>
> #14 0x00007fb98183934d in clone () from /lib64/libc.so.6
>
>
>
>
>
> Thanks!
>
> Pulkit Tandon
>
>
>
>