Juniper Openstack

Bug #1764493
Comment #15

Comment 15 for bug 1764493

Revision history for this message

Andrey Pavlov (apavlov-e) wrote on 2018-04-17: Re: Debugging required on k8s sanity setup which failed for R5.0-16

#15

Michael, it helped for me.

Regards,
Andrey.

вт, 17 апр. 2018 г., 18:41 Michael Henkel <email address hidden>:

> ok, let me know how it goes.
> Regards,
> Michael
>
> > On Apr 17, 2018, at 8:05 AM, Andrey Pavlov <email address hidden> wrote:
> >
> > Alexey added JVM_EXTRA_OPTS to cassandra's container here
> https://review.opencontrail.org/#/c/41928/1/containers/external/cassandra/contrail-entrypoint.sh
> > Now I'm checking this way
> https://github.com/cloudscaling/juniper-ci/blob/master/contrail-containers/ansible/instances.yaml.tmpl#L69
> >
> >
> > Regards,
> > Andrey Pavlov.
> >
> > On Tue, Apr 17, 2018 at 6:02 PM, Michael Henkel <email address hidden>
> wrote:
> > And since then we have the cassandra problems? The symptoms clearly
> point towards memory shortage.
> > We have to expose the heap size as a parameter, otherwise Java is
> running crazy.
> > Regards,
> > Michael
> >
> > > On Apr 17, 2018, at 7:21 AM, Andrey Pavlov <email address hidden>
> wrote:
> > >
> > > btw, memory change for cassandra was merged recently -
> https://review.opencontrail.org/#/c/41767/1/containers/external/cassandra/contrail-entrypoint.sh
> > >
> > > Regards,
> > > Andrey Pavlov.
> > >
> > > On Tue, Apr 17, 2018 at 4:19 PM, Andrey Pavlov <email address hidden>
> wrote:
> > > root@node-10-1-56-124:/# nodetool -p 7200 status
> > > Datacenter: datacenter1
> > > =======================
> > > Status=Up/Down
> > > |/ State=Normal/Leaving/Joining/Moving
> > > -- Address Load Tokens Owns (effective) Host ID
> Rack
> > > UN 10.1.56.125 3.11 MiB 256 68.5%
> 468a1809-53ee-4242-971f-3015ccedc6c2 rack1
> > > UN 10.1.56.124 1.89 MiB 256 72.2%
> 9aa41a48-3e9c-417d-b25c-7abf5e1f94aa rack1
> > > UN 10.1.56.126 3.63 MiB 256 59.3%
> 33e498c9-f3e2-4430-86b4-261b0ffbaa0e rack1
> > >
> > > root@node-10-1-56-124:/# nodetool -p 7200 statusgossip
> > > running
> > > root@node-10-1-56-124:/# nodetool -p 7200 statusthrift
> > > running
> > > root@node-10-1-56-124:/# nodetool -p 7200 statusbinary
> > > running
> > >
> > >
> > > Regards,
> > > Andrey Pavlov.
> > >
> > > On Tue, Apr 17, 2018 at 4:16 PM, Michael Henkel <email address hidden>
> wrote:
> > > Hi Andrey, did you check nodetool status?
> > >
> > > Regards,
> > > Michael
> > >
> > > Am 17.04.2018 um 06:04 schrieb Andrey Pavlov <email address hidden>:
> > >
> > >> Hey Michael,
> > >>
> > >> I have similar problems in my 3-nodes setup:
> > >>
> > >> == Contrail control ==
> > >> control: active
> > >> nodemgr: active
> > >> named: active
> > >> dns: active
> > >>
> > >> == Contrail analytics ==
> > >> snmp-collector: initializing (Database:Cassandra[] connection down)
> > >> query-engine: active
> > >> api: active
> > >> alarm-gen: initializing (Database:Cassandra[] connection down)
> > >> nodemgr: active
> > >> collector: initializing (Database:Cassandra connection down)
> > >> topology: initializing (Database:Cassandra[] connection down)
> > >>
> > >> == Contrail config ==
> > >> api: initializing (Database:Cassandra[] connection down)
> > >> zookeeper: active
> > >> svc-monitor: backup
> > >> nodemgr: active
> > >> device-manager: backup
> > >> cassandra: active
> > >> rabbitmq: active
> > >> schema: backup
> > >>
> > >> == Contrail webui ==
> > >> web: active
> > >> job: active
> > >>
> > >> == Contrail database ==
> > >> kafka: active
> > >> nodemgr: active
> > >> zookeeper: active
> > >> cassandra: active
> > >>
> > >> [root@node-10-1-56-124 ~]# free -hw
> > >> total used free shared buffers
> cache available
> > >> Mem: 15G 11G 3.3G 28M 0B
> 892M 3.7G
> > >> Swap: 0B 0B 0B
> > >>
> > >>
> > >> Regards,
> > >> Andrey Pavlov.
> > >>
> > >> On Tue, Apr 17, 2018 at 3:57 PM, Michael Henkel <email address hidden>
> wrote:
> > >> Pulkit,
> > >>
> > >> How many resources did you assign to your instances?
> > >>
> > >> Regards,
> > >> Michael
> > >>
> > >> Am 17.04.2018 um 05:37 schrieb Pulkit Tandon <email address hidden>:
> > >>
> > >>> Hi All,
> > >>>
> > >>>
> > >>>
> > >>> I need your help and expertise debugging the k8s sanity setup which
> is in really bad state. Things are messier starting build 15.
> > >>>
> > >>> I observed multiple problems on current attempt. Not sure if they
> are linked or all are different.
> > >>>
> > >>> Kept the setup in same setup so that you can debug the failures on
> live setup.
> > >>>
> > >>>
> > >>>
> > >>> K8s HA Setup details:
> > >>>
> > >>> 3 Controller+kube managers:
> > >>>
> > >>> 10.204.217.52(nodeg12)
> > >>>
> > >>> 10.204.217.71(nodeg31)
> > >>>
> > >>> 10.204.217.98(nodec58)
> > >>>
> > >>> 2 Agents/ k8s slave:
> > >>>
> > >>> 10.204.217.100(nodec60)
> > >>>
> > >>> 10.204.217.101(nodec61)
> > >>>
> > >>> Multi interface setup
> > >>>
> > >>>
> > >>>
> > >>> Following are key observations:
> > >>>
> > >>> 1. RabbitMQ cluster formed between nodeg12 and nodeg31.
> Nodec58 has rabbitmq as inactive.
> > >>>
> > >>> rabbitmq: inactive
> > >>>
> > >>> Docker logs for rabbitmq container on nodec58:
> > >>>
> > >>> {"init terminating in do_boot",{error,{inconsistent_cluster,"Node
> contrail@nodec58 thinks it's clustered with node contrail@nodeg31, but
> contrail@nodeg31 disagrees"}}}
> > >>>
> > >>>
> > >>>
> > >>> 2. On all 3 controllers, Cassandra connection not established
> for 2 hours after provisioning. This issue seems flapping with time and
> sometimes, I see the services as active too:
> > >>> control: initializing (Database:Cassandra connection down)
> > >>> collector: initializing (Database:Cassandra connection down)
> > >>>
> > >>>
> > >>>
> > >>> 3. If I create a k8s Pod, many a times it results in POD
> creation failure and instantly vrouter crash happens.
> > >>> The trace is below.
> > >>> Irrespective of crash happens or not, POD creation fails
> > >>>
> > >>>
> > >>>
> > >>> 4. ON CNI of both agent, seeing this error:
> > >>> I : 24646 : 2018/04/17 17:35:44 vrouter.go:79: VRouter request.
> Operation : GET Url :
> http://127.0.0.1:9091/vm/7a271412-4237-11e8-8997-002590c55f6a
> > >>>
> > >>> E : 24646 : 2018/04/17 17:35:44 vrouter.go:147: Failed HTTP Get
> operation. Return code 404
> > >>>
> > >>> I : 24646 : 2018/04/17 17:35:44 vrouter.go:181: Iteration 14 : Get
> vrouter failed
> > >>>
> > >>> E : 24633 : 2018/04/17 17:35:49 vrouter.go:287: Error in polling
> VRouter
> > >>>
> > >>> I : 24633 : 2018/04/17 17:35:49 cni.go:175: Error in Add to VRouter
> > >>>
> > >>> E : 24633 : 2018/04/17 17:35:49 contrail-kube-cni.go:67: Failed
> processing Add command.
> > >>>
> > >>> E : 24646 : 2018/04/17 17:35:49 vrouter.go:287: Error in polling
> VRouter
> > >>>
> > >>> I : 24646 : 2018/04/17 17:35:49 cni.go:175: Error in Add to VRouter
> > >>>
> > >>> E : 24646 : 2018/04/17 17:35:49 contrail-kube-cni.go:67: Failed
> processing Add command.
> > >>>
> > >>>
> > >>>
> > >>> NOTE: Most of the issues observed are on k8s HA multi interface
> setup.
> > >>>
> > >>> Things are better with Non HA/ single interface setup.
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> Agent crash trace:
> > >>>
> > >>> (gdb) bt full
> > >>>
> > >>> #0 0x00007fb9817761f7 in raise () from /lib64/libc.so.6
> > >>>
> > >>> No symbol table info available.
> > >>>
> > >>> #1 0x00007fb9817778e8 in abort () from /lib64/libc.so.6
> > >>>
> > >>> No symbol table info available.
> > >>>
> > >>> #2 0x00007fb98176f266 in __assert_fail_base () from /lib64/libc.so.6
> > >>>
> > >>> No symbol table info available.
> > >>>
> > >>> #3 0x00007fb98176f312 in __assert_fail () from /lib64/libc.so.6
> > >>>
> > >>> No symbol table info available.
> > >>>
> > >>> #4 0x0000000000c15440 in
> AgentOperDBTable::ConfigEventHandler(IFMapNode*, DBEntry*) ()
> > >>>
> > >>> No symbol table info available.
> > >>>
> > >>> #5 0x0000000000c41714 in
> IFMapDependencyManager::ProcessChangeList() ()
> > >>>
> > >>> No symbol table info available.
> > >>>
> > >>> #6 0x0000000000ea4a57 in TaskTrigger::WorkerTask::Run() ()
> > >>>
> > >>> No symbol table info available.
> > >>>
> > >>> #7 0x0000000000e9e64f in TaskImpl::execute() ()
> > >>>
> > >>> No symbol table info available.
> > >>>
> > >>> #8 0x00007fb9823458ca in
> tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all(tbb::task&,
> tbb::task*) () from /lib64/libtbb.so.2
> > >>>
> > >>> No symbol table info available.
> > >>>
> > >>> #9 0x00007fb9823415b6 in
> tbb::internal::arena::process(tbb::internal::generic_scheduler&) () from
> /lib64/libtbb.so.2
> > >>>
> > >>> No symbol table info available.
> > >>>
> > >>> #10 0x00007fb982340c8b in tbb::internal::market::process(rml::job&)
> () from /lib64/libtbb.so.2
> > >>>
> > >>> No symbol table info available.
> > >>>
> > >>> #11 0x00007fb98233e67f in tbb::internal::rml::private_worker::run()
> () from /lib64/libtbb.so.2
> > >>>
> > >>> No symbol table info available.
> > >>>
> > >>> #12 0x00007fb98233e879 in
> tbb::internal::rml::private_worker::thread_routine(void*) () from
> /lib64/libtbb.so.2
> > >>>
> > >>> No symbol table info available.
> > >>>
> > >>> #13 0x00007fb982560e25 in start_thread () from /lib64/libpthread.so.0
> > >>>
> > >>> No symbol table info available.
> > >>>
> > >>> #14 0x00007fb98183934d in clone () from /lib64/libc.so.6
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> Thanks!
> > >>>
> > >>> Pulkit Tandon
> > >>>
> > >>>
> > >>>
> > >>
> > >
> > >
> >
> >
>
>

Michael, it helped for me.

Regards,
Andrey.

вт, 17 апр. 2018 г., 18:41 Michael Henkel <mhenkel@juniper.net>:

> ok, let me know how it goes.
> Regards,
> Michael
>
> > On Apr 17, 2018, at 8:05 AM, Andrey Pavlov <andrey.mp@gmail.com> wrote:
> >
> > Alexey added JVM_EXTRA_OPTS to cassandra's container here
> https://review.opencontrail.org/#/c/41928/1/containers/external/cassandra/contrail-entrypoint.sh
> > Now I'm checking this way
> https://github.com/cloudscaling/juniper-ci/blob/master/contrail-containers/ansible/instances.yaml.tmpl#L69
> >
> >
> > Regards,
> > Andrey Pavlov.
> >
> > On Tue, Apr 17, 2018 at 6:02 PM, Michael Henkel <mhenkel@juniper.net>
> wrote:
> > And since then we have the cassandra problems? The symptoms clearly
> point towards memory shortage.
> > We have to expose the heap size as a parameter, otherwise Java is
> running crazy.
> > Regards,
> > Michael
> >
> > > On Apr 17, 2018, at 7:21 AM, Andrey Pavlov <andrey.mp@gmail.com>
> wrote:
> > >
> > > btw, memory change for cassandra was merged recently -
> https://review.opencontrail.org/#/c/41767/1/containers/external/cassandra/contrail-entrypoint.sh
> > >
> > > Regards,
> > > Andrey Pavlov.
> > >
> > > On Tue, Apr 17, 2018 at 4:19 PM, Andrey Pavlov <andrey.mp@gmail.com>
> wrote:
> > > root@node-10-1-56-124:/# nodetool -p 7200 status
> > > Datacenter: datacenter1
> > > =======================
> > > Status=Up/Down
> > > |/ State=Normal/Leaving/Joining/Moving
> > > --  Address      Load       Tokens       Owns (effective)  Host ID
>                            Rack
> > > UN  10.1.56.125  3.11 MiB   256          68.5%
>  468a1809-53ee-4242-971f-3015ccedc6c2  rack1
> > > UN  10.1.56.124  1.89 MiB   256          72.2%
>  9aa41a48-3e9c-417d-b25c-7abf5e1f94aa  rack1
> > > UN  10.1.56.126  3.63 MiB   256          59.3%
>  33e498c9-f3e2-4430-86b4-261b0ffbaa0e  rack1
> > >
> > > root@node-10-1-56-124:/# nodetool -p 7200 statusgossip
> > > running
> > > root@node-10-1-56-124:/# nodetool -p 7200 statusthrift
> > > running
> > > root@node-10-1-56-124:/# nodetool -p 7200 statusbinary
> > > running
> > >
> > >
> > > Regards,
> > > Andrey Pavlov.
> > >
> > > On Tue, Apr 17, 2018 at 4:16 PM, Michael Henkel <mhenkel@juniper.net>
> wrote:
> > > Hi Andrey, did you check nodetool status?
> > >
> > > Regards,
> > > Michael
> > >
> > > Am 17.04.2018 um 06:04 schrieb Andrey Pavlov <andrey.mp@gmail.com>:
> > >
> > >> Hey Michael,
> > >>
> > >> I have similar problems in my 3-nodes setup:
> > >>
> > >> == Contrail control ==
> > >> control: active
> > >> nodemgr: active
> > >> named: active
> > >> dns: active
> > >>
> > >> == Contrail analytics ==
> > >> snmp-collector: initializing (Database:Cassandra[] connection down)
> > >> query-engine: active
> > >> api: active
> > >> alarm-gen: initializing (Database:Cassandra[] connection down)
> > >> nodemgr: active
> > >> collector: initializing (Database:Cassandra connection down)
> > >> topology: initializing (Database:Cassandra[] connection down)
> > >>
> > >> == Contrail config ==
> > >> api: initializing (Database:Cassandra[] connection down)
> > >> zookeeper: active
> > >> svc-monitor: backup
> > >> nodemgr: active
> > >> device-manager: backup
> > >> cassandra: active
> > >> rabbitmq: active
> > >> schema: backup
> > >>
> > >> == Contrail webui ==
> > >> web: active
> > >> job: active
> > >>
> > >> == Contrail database ==
> > >> kafka: active
> > >> nodemgr: active
> > >> zookeeper: active
> > >> cassandra: active
> > >>
> > >> [root@node-10-1-56-124 ~]# free -hw
> > >>               total        used        free      shared     buffers
>      cache   available
> > >> Mem:            15G         11G        3.3G         28M          0B
>       892M        3.7G
> > >> Swap:            0B          0B          0B
> > >>
> > >>
> > >> Regards,
> > >> Andrey Pavlov.
> > >>
> > >> On Tue, Apr 17, 2018 at 3:57 PM, Michael Henkel <mhenkel@juniper.net>
> wrote:
> > >> Pulkit,
> > >>
> > >> How many resources did you assign to your instances?
> > >>
> > >> Regards,
> > >> Michael
> > >>
> > >> Am 17.04.2018 um 05:37 schrieb Pulkit Tandon <pulkitt@juniper.net>:
> > >>
> > >>> Hi All,
> > >>>
> > >>>
> > >>>
> > >>> I need your help and expertise debugging the k8s sanity setup which
> is in really bad state. Things are messier starting build 15.
> > >>>
> > >>> I observed multiple problems on current attempt. Not sure if they
> are linked or all are different.
> > >>>
> > >>> Kept the setup in same setup so that you can debug the failures on
> live setup.
> > >>>
> > >>>
> > >>>
> > >>> K8s HA Setup details:
> > >>>
> > >>> 3 Controller+kube managers:
> > >>>
> > >>> 10.204.217.52(nodeg12)
> > >>>
> > >>> 10.204.217.71(nodeg31)
> > >>>
> > >>> 10.204.217.98(nodec58)
> > >>>
> > >>> 2 Agents/ k8s slave:
> > >>>
> > >>> 10.204.217.100(nodec60)
> > >>>
> > >>> 10.204.217.101(nodec61)
> > >>>
> > >>> Multi interface setup
> > >>>
> > >>>
> > >>>
> > >>> Following are key observations:
> > >>>
> > >>> 1.       RabbitMQ cluster formed between nodeg12 and nodeg31.
> Nodec58 has rabbitmq as inactive.
> > >>>
> > >>> rabbitmq: inactive
> > >>>
> > >>> Docker logs for rabbitmq container on nodec58:
> > >>>
> > >>> {"init terminating in do_boot",{error,{inconsistent_cluster,"Node
> contrail@nodec58 thinks it's clustered with node contrail@nodeg31, but
> contrail@nodeg31 disagrees"}}}
> > >>>
> > >>>
> > >>>
> > >>> 2.       On all 3 controllers, Cassandra connection not established
> for 2 hours after provisioning. This issue seems flapping with time and
> sometimes, I see the services as active too:
> > >>> control: initializing (Database:Cassandra connection down)
> > >>> collector: initializing (Database:Cassandra connection down)
> > >>>
> > >>>
> > >>>
> > >>> 3.       If I create a k8s  Pod, many a times it results in POD
> creation failure and instantly vrouter crash happens.
> > >>> The trace is below.
> > >>> Irrespective of crash happens or not, POD creation fails
> > >>>
> > >>>
> > >>>
> > >>> 4.       ON CNI of both agent, seeing this error:
> > >>> I : 24646 : 2018/04/17 17:35:44 vrouter.go:79: VRouter request.
> Operation : GET Url :
> http://127.0.0.1:9091/vm/7a271412-4237-11e8-8997-002590c55f6a
> > >>>
> > >>> E : 24646 : 2018/04/17 17:35:44 vrouter.go:147: Failed HTTP Get
> operation. Return code 404
> > >>>
> > >>> I : 24646 : 2018/04/17 17:35:44 vrouter.go:181: Iteration 14 : Get
> vrouter failed
> > >>>
> > >>> E : 24633 : 2018/04/17 17:35:49 vrouter.go:287: Error in polling
> VRouter
> > >>>
> > >>> I : 24633 : 2018/04/17 17:35:49 cni.go:175: Error in Add to VRouter
> > >>>
> > >>> E : 24633 : 2018/04/17 17:35:49 contrail-kube-cni.go:67: Failed
> processing Add command.
> > >>>
> > >>> E : 24646 : 2018/04/17 17:35:49 vrouter.go:287: Error in polling
> VRouter
> > >>>
> > >>> I : 24646 : 2018/04/17 17:35:49 cni.go:175: Error in Add to VRouter
> > >>>
> > >>> E : 24646 : 2018/04/17 17:35:49 contrail-kube-cni.go:67: Failed
> processing Add command.
> > >>>
> > >>>
> > >>>
> > >>> NOTE: Most of the issues observed are on k8s HA multi interface
> setup.
> > >>>
> > >>>              Things are better with Non HA/ single interface setup.
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> Agent crash trace:
> > >>>
> > >>> (gdb) bt full
> > >>>
> > >>> #0  0x00007fb9817761f7 in raise () from /lib64/libc.so.6
> > >>>
> > >>> No symbol table info available.
> > >>>
> > >>> #1  0x00007fb9817778e8 in abort () from /lib64/libc.so.6
> > >>>
> > >>> No symbol table info available.
> > >>>
> > >>> #2  0x00007fb98176f266 in __assert_fail_base () from /lib64/libc.so.6
> > >>>
> > >>> No symbol table info available.
> > >>>
> > >>> #3  0x00007fb98176f312 in __assert_fail () from /lib64/libc.so.6
> > >>>
> > >>> No symbol table info available.
> > >>>
> > >>> #4  0x0000000000c15440 in
> AgentOperDBTable::ConfigEventHandler(IFMapNode*, DBEntry*) ()
> > >>>
> > >>> No symbol table info available.
> > >>>
> > >>> #5  0x0000000000c41714 in
> IFMapDependencyManager::ProcessChangeList() ()
> > >>>
> > >>> No symbol table info available.
> > >>>
> > >>> #6  0x0000000000ea4a57 in TaskTrigger::WorkerTask::Run() ()
> > >>>
> > >>> No symbol table info available.
> > >>>
> > >>> #7  0x0000000000e9e64f in TaskImpl::execute() ()
> > >>>
> > >>> No symbol table info available.
> > >>>
> > >>> #8  0x00007fb9823458ca in
> tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all(tbb::task&,
> tbb::task*) () from /lib64/libtbb.so.2
> > >>>
> > >>> No symbol table info available.
> > >>>
> > >>> #9  0x00007fb9823415b6 in
> tbb::internal::arena::process(tbb::internal::generic_scheduler&) () from
> /lib64/libtbb.so.2
> > >>>
> > >>> No symbol table info available.
> > >>>
> > >>> #10 0x00007fb982340c8b in tbb::internal::market::process(rml::job&)
> () from /lib64/libtbb.so.2
> > >>>
> > >>> No symbol table info available.
> > >>>
> > >>> #11 0x00007fb98233e67f in tbb::internal::rml::private_worker::run()
> () from /lib64/libtbb.so.2
> > >>>
> > >>> No symbol table info available.
> > >>>
> > >>> #12 0x00007fb98233e879 in
> tbb::internal::rml::private_worker::thread_routine(void*) () from
> /lib64/libtbb.so.2
> > >>>
> > >>> No symbol table info available.
> > >>>
> > >>> #13 0x00007fb982560e25 in start_thread () from /lib64/libpthread.so.0
> > >>>
> > >>> No symbol table info available.
> > >>>
> > >>> #14 0x00007fb98183934d in clone () from /lib64/libc.so.6
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> Thanks!
> > >>>
> > >>> Pulkit Tandon
> > >>>
> > >>>
> > >>>
> > >>
> > >
> > >
> >
> >
>
>