Juniper Openstack

Bug #1764493
Comment #12

Comment 12 for bug 1764493

Revision history for this message

Andrey Pavlov (apavlov-e) wrote on 2018-04-17: Re: Debugging required on k8s sanity setup which failed for R5.0-16

#12

Alexey added JVM_EXTRA_OPTS to cassandra's container here
https://review.opencontrail.org/#/c/41928/1/containers/external/cassandra/contrail-entrypoint.sh
Now I'm checking this way
https://github.com/cloudscaling/juniper-ci/blob/master/contrail-containers/ansible/instances.yaml.tmpl#L69

Regards,
Andrey Pavlov.

On Tue, Apr 17, 2018 at 6:02 PM, Michael Henkel <email address hidden> wrote:

> And since then we have the cassandra problems? The symptoms clearly point
> towards memory shortage.
> We have to expose the heap size as a parameter, otherwise Java is running
> crazy.
> Regards,
> Michael
>
> > On Apr 17, 2018, at 7:21 AM, Andrey Pavlov <email address hidden> wrote:
> >
> > btw, memory change for cassandra was merged recently -
> https://review.opencontrail.org/#/c/41767/1/containers/
> external/cassandra/contrail-entrypoint.sh
> >
> > Regards,
> > Andrey Pavlov.
> >
> > On Tue, Apr 17, 2018 at 4:19 PM, Andrey Pavlov <email address hidden>
> wrote:
> > root@node-10-1-56-124:/# nodetool -p 7200 status
> > Datacenter: datacenter1
> > =======================
> > Status=Up/Down
> > |/ State=Normal/Leaving/Joining/Moving
> > -- Address Load Tokens Owns (effective) Host ID
> Rack
> > UN 10.1.56.125 3.11 MiB 256 68.5%
> 468a1809-53ee-4242-971f-3015ccedc6c2 rack1
> > UN 10.1.56.124 1.89 MiB 256 72.2%
> 9aa41a48-3e9c-417d-b25c-7abf5e1f94aa rack1
> > UN 10.1.56.126 3.63 MiB 256 59.3%
> 33e498c9-f3e2-4430-86b4-261b0ffbaa0e rack1
> >
> > root@node-10-1-56-124:/# nodetool -p 7200 statusgossip
> > running
> > root@node-10-1-56-124:/# nodetool -p 7200 statusthrift
> > running
> > root@node-10-1-56-124:/# nodetool -p 7200 statusbinary
> > running
> >
> >
> > Regards,
> > Andrey Pavlov.
> >
> > On Tue, Apr 17, 2018 at 4:16 PM, Michael Henkel <email address hidden>
> wrote:
> > Hi Andrey, did you check nodetool status?
> >
> > Regards,
> > Michael
> >
> > Am 17.04.2018 um 06:04 schrieb Andrey Pavlov <email address hidden>:
> >
> >> Hey Michael,
> >>
> >> I have similar problems in my 3-nodes setup:
> >>
> >> == Contrail control ==
> >> control: active
> >> nodemgr: active
> >> named: active
> >> dns: active
> >>
> >> == Contrail analytics ==
> >> snmp-collector: initializing (Database:Cassandra[] connection down)
> >> query-engine: active
> >> api: active
> >> alarm-gen: initializing (Database:Cassandra[] connection down)
> >> nodemgr: active
> >> collector: initializing (Database:Cassandra connection down)
> >> topology: initializing (Database:Cassandra[] connection down)
> >>
> >> == Contrail config ==
> >> api: initializing (Database:Cassandra[] connection down)
> >> zookeeper: active
> >> svc-monitor: backup
> >> nodemgr: active
> >> device-manager: backup
> >> cassandra: active
> >> rabbitmq: active
> >> schema: backup
> >>
> >> == Contrail webui ==
> >> web: active
> >> job: active
> >>
> >> == Contrail database ==
> >> kafka: active
> >> nodemgr: active
> >> zookeeper: active
> >> cassandra: active
> >>
> >> [root@node-10-1-56-124 ~]# free -hw
> >> total used free shared buffers
> cache available
> >> Mem: 15G 11G 3.3G 28M 0B
> 892M 3.7G
> >> Swap: 0B 0B 0B
> >>
> >>
> >> Regards,
> >> Andrey Pavlov.
> >>
> >> On Tue, Apr 17, 2018 at 3:57 PM, Michael Henkel <email address hidden>
> wrote:
> >> Pulkit,
> >>
> >> How many resources did you assign to your instances?
> >>
> >> Regards,
> >> Michael
> >>
> >> Am 17.04.2018 um 05:37 schrieb Pulkit Tandon <email address hidden>:
> >>
> >>> Hi All,
> >>>
> >>>
> >>>
> >>> I need your help and expertise debugging the k8s sanity setup which is
> in really bad state. Things are messier starting build 15.
> >>>
> >>> I observed multiple problems on current attempt. Not sure if they are
> linked or all are different.
> >>>
> >>> Kept the setup in same setup so that you can debug the failures on
> live setup.
> >>>
> >>>
> >>>
> >>> K8s HA Setup details:
> >>>
> >>> 3 Controller+kube managers:
> >>>
> >>> 10.204.217.52(nodeg12)
> >>>
> >>> 10.204.217.71(nodeg31)
> >>>
> >>> 10.204.217.98(nodec58)
> >>>
> >>> 2 Agents/ k8s slave:
> >>>
> >>> 10.204.217.100(nodec60)
> >>>
> >>> 10.204.217.101(nodec61)
> >>>
> >>> Multi interface setup
> >>>
> >>>
> >>>
> >>> Following are key observations:
> >>>
> >>> 1. RabbitMQ cluster formed between nodeg12 and nodeg31. Nodec58
> has rabbitmq as inactive.
> >>>
> >>> rabbitmq: inactive
> >>>
> >>> Docker logs for rabbitmq container on nodec58:
> >>>
> >>> {"init terminating in do_boot",{error,{inconsistent_cluster,"Node
> contrail@nodec58 thinks it's clustered with node contrail@nodeg31, but
> contrail@nodeg31 disagrees"}}}
> >>>
> >>>
> >>>
> >>> 2. On all 3 controllers, Cassandra connection not established
> for 2 hours after provisioning. This issue seems flapping with time and
> sometimes, I see the services as active too:
> >>> control: initializing (Database:Cassandra connection down)
> >>> collector: initializing (Database:Cassandra connection down)
> >>>
> >>>
> >>>
> >>> 3. If I create a k8s Pod, many a times it results in POD
> creation failure and instantly vrouter crash happens.
> >>> The trace is below.
> >>> Irrespective of crash happens or not, POD creation fails
> >>>
> >>>
> >>>
> >>> 4. ON CNI of both agent, seeing this error:
> >>> I : 24646 : 2018/04/17 17:35:44 vrouter.go:79: VRouter request.
> Operation : GET Url : http://127.0.0.1:9091/vm/7a271412-4237-11e8-8997-
> 002590c55f6a
> >>>
> >>> E : 24646 : 2018/04/17 17:35:44 vrouter.go:147: Failed HTTP Get
> operation. Return code 404
> >>>
> >>> I : 24646 : 2018/04/17 17:35:44 vrouter.go:181: Iteration 14 : Get
> vrouter failed
> >>>
> >>> E : 24633 : 2018/04/17 17:35:49 vrouter.go:287: Error in polling
> VRouter
> >>>
> >>> I : 24633 : 2018/04/17 17:35:49 cni.go:175: Error in Add to VRouter
> >>>
> >>> E : 24633 : 2018/04/17 17:35:49 contrail-kube-cni.go:67: Failed
> processing Add command.
> >>>
> >>> E : 24646 : 2018/04/17 17:35:49 vrouter.go:287: Error in polling
> VRouter
> >>>
> >>> I : 24646 : 2018/04/17 17:35:49 cni.go:175: Error in Add to VRouter
> >>>
> >>> E : 24646 : 2018/04/17 17:35:49 contrail-kube-cni.go:67: Failed
> processing Add command.
> >>>
> >>>
> >>>
> >>> NOTE: Most of the issues observed are on k8s HA multi interface setup.
> >>>
> >>> Things are better with Non HA/ single interface setup.
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> Agent crash trace:
> >>>
> >>> (gdb) bt full
> >>>
> >>> #0 0x00007fb9817761f7 in raise () from /lib64/libc.so.6
> >>>
> >>> No symbol table info available.
> >>>
> >>> #1 0x00007fb9817778e8 in abort () from /lib64/libc.so.6
> >>>
> >>> No symbol table info available.
> >>>
> >>> #2 0x00007fb98176f266 in __assert_fail_base () from /lib64/libc.so.6
> >>>
> >>> No symbol table info available.
> >>>
> >>> #3 0x00007fb98176f312 in __assert_fail () from /lib64/libc.so.6
> >>>
> >>> No symbol table info available.
> >>>
> >>> #4 0x0000000000c15440 in AgentOperDBTable::ConfigEventHandler(IFMapNode*,
> DBEntry*) ()
> >>>
> >>> No symbol table info available.
> >>>
> >>> #5 0x0000000000c41714 in IFMapDependencyManager::ProcessChangeList()
> ()
> >>>
> >>> No symbol table info available.
> >>>
> >>> #6 0x0000000000ea4a57 in TaskTrigger::WorkerTask::Run() ()
> >>>
> >>> No symbol table info available.
> >>>
> >>> #7 0x0000000000e9e64f in TaskImpl::execute() ()
> >>>
> >>> No symbol table info available.
> >>>
> >>> #8 0x00007fb9823458ca in tbb::internal::custom_
> scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all(tbb::task&,
> tbb::task*) () from /lib64/libtbb.so.2
> >>>
> >>> No symbol table info available.
> >>>
> >>> #9 0x00007fb9823415b6 in tbb::internal::arena::process(
> tbb::internal::generic_scheduler&) () from /lib64/libtbb.so.2
> >>>
> >>> No symbol table info available.
> >>>
> >>> #10 0x00007fb982340c8b in tbb::internal::market::process(rml::job&)
> () from /lib64/libtbb.so.2
> >>>
> >>> No symbol table info available.
> >>>
> >>> #11 0x00007fb98233e67f in tbb::internal::rml::private_worker::run()
> () from /lib64/libtbb.so.2
> >>>
> >>> No symbol table info available.
> >>>
> >>> #12 0x00007fb98233e879 in tbb::internal::rml::private_worker::thread_routine(void*)
> () from /lib64/libtbb.so.2
> >>>
> >>> No symbol table info available.
> >>>
> >>> #13 0x00007fb982560e25 in start_thread () from /lib64/libpthread.so.0
> >>>
> >>> No symbol table info available.
> >>>
> >>> #14 0x00007fb98183934d in clone () from /lib64/libc.so.6
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> Thanks!
> >>>
> >>> Pulkit Tandon
> >>>
> >>>
> >>>
> >>
> >
> >
>
>

Regards,
Andrey Pavlov.

On Tue, Apr 17, 2018 at 6:02 PM, Michael Henkel <mhenkel@juniper.net> wrote:

> And since then we have the cassandra problems? The symptoms clearly point
> towards memory shortage.
> We have to expose the heap size as a parameter, otherwise Java is running
> crazy.
> Regards,
> Michael
>
> > On Apr 17, 2018, at 7:21 AM, Andrey Pavlov <andrey.mp@gmail.com> wrote:
> >
> > btw, memory change for cassandra was merged recently -
> https://review.opencontrail.org/#/c/41767/1/containers/
> external/cassandra/contrail-entrypoint.sh
> >
> > Regards,
> > Andrey Pavlov.
> >
> > On Tue, Apr 17, 2018 at 4:19 PM, Andrey Pavlov <andrey.mp@gmail.com>
> wrote:
> > root@node-10-1-56-124:/# nodetool -p 7200 status
> > Datacenter: datacenter1
> > =======================
> > Status=Up/Down
> > |/ State=Normal/Leaving/Joining/Moving
> > --  Address      Load       Tokens       Owns (effective)  Host ID
>                          Rack
> > UN  10.1.56.125  3.11 MiB   256          68.5%
>  468a1809-53ee-4242-971f-3015ccedc6c2  rack1
> > UN  10.1.56.124  1.89 MiB   256          72.2%
>  9aa41a48-3e9c-417d-b25c-7abf5e1f94aa  rack1
> > UN  10.1.56.126  3.63 MiB   256          59.3%
>  33e498c9-f3e2-4430-86b4-261b0ffbaa0e  rack1
> >
> > root@node-10-1-56-124:/# nodetool -p 7200 statusgossip
> > running
> > root@node-10-1-56-124:/# nodetool -p 7200 statusthrift
> > running
> > root@node-10-1-56-124:/# nodetool -p 7200 statusbinary
> > running
> >
> >
> > Regards,
> > Andrey Pavlov.
> >
> > On Tue, Apr 17, 2018 at 4:16 PM, Michael Henkel <mhenkel@juniper.net>
> wrote:
> > Hi Andrey, did you check nodetool status?
> >
> > Regards,
> > Michael
> >
> > Am 17.04.2018 um 06:04 schrieb Andrey Pavlov <andrey.mp@gmail.com>:
> >
> >> Hey Michael,
> >>
> >> I have similar problems in my 3-nodes setup:
> >>
> >> == Contrail control ==
> >> control: active
> >> nodemgr: active
> >> named: active
> >> dns: active
> >>
> >> == Contrail analytics ==
> >> snmp-collector: initializing (Database:Cassandra[] connection down)
> >> query-engine: active
> >> api: active
> >> alarm-gen: initializing (Database:Cassandra[] connection down)
> >> nodemgr: active
> >> collector: initializing (Database:Cassandra connection down)
> >> topology: initializing (Database:Cassandra[] connection down)
> >>
> >> == Contrail config ==
> >> api: initializing (Database:Cassandra[] connection down)
> >> zookeeper: active
> >> svc-monitor: backup
> >> nodemgr: active
> >> device-manager: backup
> >> cassandra: active
> >> rabbitmq: active
> >> schema: backup
> >>
> >> == Contrail webui ==
> >> web: active
> >> job: active
> >>
> >> == Contrail database ==
> >> kafka: active
> >> nodemgr: active
> >> zookeeper: active
> >> cassandra: active
> >>
> >> [root@node-10-1-56-124 ~]# free -hw
> >>               total        used        free      shared     buffers
>    cache   available
> >> Mem:            15G         11G        3.3G         28M          0B
>     892M        3.7G
> >> Swap:            0B          0B          0B
> >>
> >>
> >> Regards,
> >> Andrey Pavlov.
> >>
> >> On Tue, Apr 17, 2018 at 3:57 PM, Michael Henkel <mhenkel@juniper.net>
> wrote:
> >> Pulkit,
> >>
> >> How many resources did you assign to your instances?
> >>
> >> Regards,
> >> Michael
> >>
> >> Am 17.04.2018 um 05:37 schrieb Pulkit Tandon <pulkitt@juniper.net>:
> >>
> >>> Hi All,
> >>>
> >>>
> >>>
> >>> I need your help and expertise debugging the k8s sanity setup which is
> in really bad state. Things are messier starting build 15.
> >>>
> >>> I observed multiple problems on current attempt. Not sure if they are
> linked or all are different.
> >>>
> >>> Kept the setup in same setup so that you can debug the failures on
> live setup.
> >>>
> >>>
> >>>
> >>> K8s HA Setup details:
> >>>
> >>> 3 Controller+kube managers:
> >>>
> >>> 10.204.217.52(nodeg12)
> >>>
> >>> 10.204.217.71(nodeg31)
> >>>
> >>> 10.204.217.98(nodec58)
> >>>
> >>> 2 Agents/ k8s slave:
> >>>
> >>> 10.204.217.100(nodec60)
> >>>
> >>> 10.204.217.101(nodec61)
> >>>
> >>> Multi interface setup
> >>>
> >>>
> >>>
> >>> Following are key observations:
> >>>
> >>> 1.       RabbitMQ cluster formed between nodeg12 and nodeg31. Nodec58
> has rabbitmq as inactive.
> >>>
> >>> rabbitmq: inactive
> >>>
> >>> Docker logs for rabbitmq container on nodec58:
> >>>
> >>> {"init terminating in do_boot",{error,{inconsistent_cluster,"Node
> contrail@nodec58 thinks it's clustered with node contrail@nodeg31, but
> contrail@nodeg31 disagrees"}}}
> >>>
> >>>
> >>>
> >>> 2.       On all 3 controllers, Cassandra connection not established
> for 2 hours after provisioning. This issue seems flapping with time and
> sometimes, I see the services as active too:
> >>> control: initializing (Database:Cassandra connection down)
> >>> collector: initializing (Database:Cassandra connection down)
> >>>
> >>>
> >>>
> >>> 3.       If I create a k8s  Pod, many a times it results in POD
> creation failure and instantly vrouter crash happens.
> >>> The trace is below.
> >>> Irrespective of crash happens or not, POD creation fails
> >>>
> >>>
> >>>
> >>> 4.       ON CNI of both agent, seeing this error:
> >>> I : 24646 : 2018/04/17 17:35:44 vrouter.go:79: VRouter request.
> Operation : GET Url :  http://127.0.0.1:9091/vm/7a271412-4237-11e8-8997-
> 002590c55f6a
> >>>
> >>> E : 24646 : 2018/04/17 17:35:44 vrouter.go:147: Failed HTTP Get
> operation. Return code 404
> >>>
> >>> I : 24646 : 2018/04/17 17:35:44 vrouter.go:181: Iteration 14 : Get
> vrouter failed
> >>>
> >>> E : 24633 : 2018/04/17 17:35:49 vrouter.go:287: Error in polling
> VRouter
> >>>
> >>> I : 24633 : 2018/04/17 17:35:49 cni.go:175: Error in Add to VRouter
> >>>
> >>> E : 24633 : 2018/04/17 17:35:49 contrail-kube-cni.go:67: Failed
> processing Add command.
> >>>
> >>> E : 24646 : 2018/04/17 17:35:49 vrouter.go:287: Error in polling
> VRouter
> >>>
> >>> I : 24646 : 2018/04/17 17:35:49 cni.go:175: Error in Add to VRouter
> >>>
> >>> E : 24646 : 2018/04/17 17:35:49 contrail-kube-cni.go:67: Failed
> processing Add command.
> >>>
> >>>
> >>>
> >>> NOTE: Most of the issues observed are on k8s HA multi interface setup.
> >>>
> >>>              Things are better with Non HA/ single interface setup.
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> Agent crash trace:
> >>>
> >>> (gdb) bt full
> >>>
> >>> #0  0x00007fb9817761f7 in raise () from /lib64/libc.so.6
> >>>
> >>> No symbol table info available.
> >>>
> >>> #1  0x00007fb9817778e8 in abort () from /lib64/libc.so.6
> >>>
> >>> No symbol table info available.
> >>>
> >>> #2  0x00007fb98176f266 in __assert_fail_base () from /lib64/libc.so.6
> >>>
> >>> No symbol table info available.
> >>>
> >>> #3  0x00007fb98176f312 in __assert_fail () from /lib64/libc.so.6
> >>>
> >>> No symbol table info available.
> >>>
> >>> #4  0x0000000000c15440 in AgentOperDBTable::ConfigEventHandler(IFMapNode*,
> DBEntry*) ()
> >>>
> >>> No symbol table info available.
> >>>
> >>> #5  0x0000000000c41714 in IFMapDependencyManager::ProcessChangeList()
> ()
> >>>
> >>> No symbol table info available.
> >>>
> >>> #6  0x0000000000ea4a57 in TaskTrigger::WorkerTask::Run() ()
> >>>
> >>> No symbol table info available.
> >>>
> >>> #7  0x0000000000e9e64f in TaskImpl::execute() ()
> >>>
> >>> No symbol table info available.
> >>>
> >>> #8  0x00007fb9823458ca in tbb::internal::custom_
> scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all(tbb::task&,
> tbb::task*) () from /lib64/libtbb.so.2
> >>>
> >>> No symbol table info available.
> >>>
> >>> #9  0x00007fb9823415b6 in tbb::internal::arena::process(
> tbb::internal::generic_scheduler&) () from /lib64/libtbb.so.2
> >>>
> >>> No symbol table info available.
> >>>
> >>> #10 0x00007fb982340c8b in tbb::internal::market::process(rml::job&)
> () from /lib64/libtbb.so.2
> >>>
> >>> No symbol table info available.
> >>>
> >>> #11 0x00007fb98233e67f in tbb::internal::rml::private_worker::run()
> () from /lib64/libtbb.so.2
> >>>
> >>> No symbol table info available.
> >>>
> >>> #12 0x00007fb98233e879 in tbb::internal::rml::private_worker::thread_routine(void*)
> () from /lib64/libtbb.so.2
> >>>
> >>> No symbol table info available.
> >>>
> >>> #13 0x00007fb982560e25 in start_thread () from /lib64/libpthread.so.0
> >>>
> >>> No symbol table info available.
> >>>
> >>> #14 0x00007fb98183934d in clone () from /lib64/libc.so.6
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> Thanks!
> >>>
> >>> Pulkit Tandon
> >>>
> >>>
> >>>
> >>
> >
> >
>
>