Michael, it helped for me.
Regards, Andrey.
вт, 17 апр. 2018 г., 18:41 Michael Henkel <email address hidden>:
> ok, let me know how it goes. > Regards, > Michael > > > On Apr 17, 2018, at 8:05 AM, Andrey Pavlov <email address hidden> wrote: > > > > Alexey added JVM_EXTRA_OPTS to cassandra's container here > https://review.opencontrail.org/#/c/41928/1/containers/external/cassandra/contrail-entrypoint.sh > > Now I'm checking this way > https://github.com/cloudscaling/juniper-ci/blob/master/contrail-containers/ansible/instances.yaml.tmpl#L69 > > > > > > Regards, > > Andrey Pavlov. > > > > On Tue, Apr 17, 2018 at 6:02 PM, Michael Henkel <email address hidden> > wrote: > > And since then we have the cassandra problems? The symptoms clearly > point towards memory shortage. > > We have to expose the heap size as a parameter, otherwise Java is > running crazy. > > Regards, > > Michael > > > > > On Apr 17, 2018, at 7:21 AM, Andrey Pavlov <email address hidden> > wrote: > > > > > > btw, memory change for cassandra was merged recently - > https://review.opencontrail.org/#/c/41767/1/containers/external/cassandra/contrail-entrypoint.sh > > > > > > Regards, > > > Andrey Pavlov. > > > > > > On Tue, Apr 17, 2018 at 4:19 PM, Andrey Pavlov <email address hidden> > wrote: > > > root@node-10-1-56-124:/# nodetool -p 7200 status > > > Datacenter: datacenter1 > > > ======================= > > > Status=Up/Down > > > |/ State=Normal/Leaving/Joining/Moving > > > -- Address Load Tokens Owns (effective) Host ID > Rack > > > UN 10.1.56.125 3.11 MiB 256 68.5% > 468a1809-53ee-4242-971f-3015ccedc6c2 rack1 > > > UN 10.1.56.124 1.89 MiB 256 72.2% > 9aa41a48-3e9c-417d-b25c-7abf5e1f94aa rack1 > > > UN 10.1.56.126 3.63 MiB 256 59.3% > 33e498c9-f3e2-4430-86b4-261b0ffbaa0e rack1 > > > > > > root@node-10-1-56-124:/# nodetool -p 7200 statusgossip > > > running > > > root@node-10-1-56-124:/# nodetool -p 7200 statusthrift > > > running > > > root@node-10-1-56-124:/# nodetool -p 7200 statusbinary > > > running > > > > > > > > > Regards, > > > Andrey Pavlov. > > > > > > On Tue, Apr 17, 2018 at 4:16 PM, Michael Henkel <email address hidden> > wrote: > > > Hi Andrey, did you check nodetool status? > > > > > > Regards, > > > Michael > > > > > > Am 17.04.2018 um 06:04 schrieb Andrey Pavlov <email address hidden>: > > > > > >> Hey Michael, > > >> > > >> I have similar problems in my 3-nodes setup: > > >> > > >> == Contrail control == > > >> control: active > > >> nodemgr: active > > >> named: active > > >> dns: active > > >> > > >> == Contrail analytics == > > >> snmp-collector: initializing (Database:Cassandra[] connection down) > > >> query-engine: active > > >> api: active > > >> alarm-gen: initializing (Database:Cassandra[] connection down) > > >> nodemgr: active > > >> collector: initializing (Database:Cassandra connection down) > > >> topology: initializing (Database:Cassandra[] connection down) > > >> > > >> == Contrail config == > > >> api: initializing (Database:Cassandra[] connection down) > > >> zookeeper: active > > >> svc-monitor: backup > > >> nodemgr: active > > >> device-manager: backup > > >> cassandra: active > > >> rabbitmq: active > > >> schema: backup > > >> > > >> == Contrail webui == > > >> web: active > > >> job: active > > >> > > >> == Contrail database == > > >> kafka: active > > >> nodemgr: active > > >> zookeeper: active > > >> cassandra: active > > >> > > >> [root@node-10-1-56-124 ~]# free -hw > > >> total used free shared buffers > cache available > > >> Mem: 15G 11G 3.3G 28M 0B > 892M 3.7G > > >> Swap: 0B 0B 0B > > >> > > >> > > >> Regards, > > >> Andrey Pavlov. > > >> > > >> On Tue, Apr 17, 2018 at 3:57 PM, Michael Henkel <email address hidden> > wrote: > > >> Pulkit, > > >> > > >> How many resources did you assign to your instances? > > >> > > >> Regards, > > >> Michael > > >> > > >> Am 17.04.2018 um 05:37 schrieb Pulkit Tandon <email address hidden>: > > >> > > >>> Hi All, > > >>> > > >>> > > >>> > > >>> I need your help and expertise debugging the k8s sanity setup which > is in really bad state. Things are messier starting build 15. > > >>> > > >>> I observed multiple problems on current attempt. Not sure if they > are linked or all are different. > > >>> > > >>> Kept the setup in same setup so that you can debug the failures on > live setup. > > >>> > > >>> > > >>> > > >>> K8s HA Setup details: > > >>> > > >>> 3 Controller+kube managers: > > >>> > > >>> 10.204.217.52(nodeg12) > > >>> > > >>> 10.204.217.71(nodeg31) > > >>> > > >>> 10.204.217.98(nodec58) > > >>> > > >>> 2 Agents/ k8s slave: > > >>> > > >>> 10.204.217.100(nodec60) > > >>> > > >>> 10.204.217.101(nodec61) > > >>> > > >>> Multi interface setup > > >>> > > >>> > > >>> > > >>> Following are key observations: > > >>> > > >>> 1. RabbitMQ cluster formed between nodeg12 and nodeg31. > Nodec58 has rabbitmq as inactive. > > >>> > > >>> rabbitmq: inactive > > >>> > > >>> Docker logs for rabbitmq container on nodec58: > > >>> > > >>> {"init terminating in do_boot",{error,{inconsistent_cluster,"Node > contrail@nodec58 thinks it's clustered with node contrail@nodeg31, but > contrail@nodeg31 disagrees"}}} > > >>> > > >>> > > >>> > > >>> 2. On all 3 controllers, Cassandra connection not established > for 2 hours after provisioning. This issue seems flapping with time and > sometimes, I see the services as active too: > > >>> control: initializing (Database:Cassandra connection down) > > >>> collector: initializing (Database:Cassandra connection down) > > >>> > > >>> > > >>> > > >>> 3. If I create a k8s Pod, many a times it results in POD > creation failure and instantly vrouter crash happens. > > >>> The trace is below. > > >>> Irrespective of crash happens or not, POD creation fails > > >>> > > >>> > > >>> > > >>> 4. ON CNI of both agent, seeing this error: > > >>> I : 24646 : 2018/04/17 17:35:44 vrouter.go:79: VRouter request. > Operation : GET Url : > http://127.0.0.1:9091/vm/7a271412-4237-11e8-8997-002590c55f6a > > >>> > > >>> E : 24646 : 2018/04/17 17:35:44 vrouter.go:147: Failed HTTP Get > operation. Return code 404 > > >>> > > >>> I : 24646 : 2018/04/17 17:35:44 vrouter.go:181: Iteration 14 : Get > vrouter failed > > >>> > > >>> E : 24633 : 2018/04/17 17:35:49 vrouter.go:287: Error in polling > VRouter > > >>> > > >>> I : 24633 : 2018/04/17 17:35:49 cni.go:175: Error in Add to VRouter > > >>> > > >>> E : 24633 : 2018/04/17 17:35:49 contrail-kube-cni.go:67: Failed > processing Add command. > > >>> > > >>> E : 24646 : 2018/04/17 17:35:49 vrouter.go:287: Error in polling > VRouter > > >>> > > >>> I : 24646 : 2018/04/17 17:35:49 cni.go:175: Error in Add to VRouter > > >>> > > >>> E : 24646 : 2018/04/17 17:35:49 contrail-kube-cni.go:67: Failed > processing Add command. > > >>> > > >>> > > >>> > > >>> NOTE: Most of the issues observed are on k8s HA multi interface > setup. > > >>> > > >>> Things are better with Non HA/ single interface setup. > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> Agent crash trace: > > >>> > > >>> (gdb) bt full > > >>> > > >>> #0 0x00007fb9817761f7 in raise () from /lib64/libc.so.6 > > >>> > > >>> No symbol table info available. > > >>> > > >>> #1 0x00007fb9817778e8 in abort () from /lib64/libc.so.6 > > >>> > > >>> No symbol table info available. > > >>> > > >>> #2 0x00007fb98176f266 in __assert_fail_base () from /lib64/libc.so.6 > > >>> > > >>> No symbol table info available. > > >>> > > >>> #3 0x00007fb98176f312 in __assert_fail () from /lib64/libc.so.6 > > >>> > > >>> No symbol table info available. > > >>> > > >>> #4 0x0000000000c15440 in > AgentOperDBTable::ConfigEventHandler(IFMapNode*, DBEntry*) () > > >>> > > >>> No symbol table info available. > > >>> > > >>> #5 0x0000000000c41714 in > IFMapDependencyManager::ProcessChangeList() () > > >>> > > >>> No symbol table info available. > > >>> > > >>> #6 0x0000000000ea4a57 in TaskTrigger::WorkerTask::Run() () > > >>> > > >>> No symbol table info available. > > >>> > > >>> #7 0x0000000000e9e64f in TaskImpl::execute() () > > >>> > > >>> No symbol table info available. > > >>> > > >>> #8 0x00007fb9823458ca in > tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all(tbb::task&, > tbb::task*) () from /lib64/libtbb.so.2 > > >>> > > >>> No symbol table info available. > > >>> > > >>> #9 0x00007fb9823415b6 in > tbb::internal::arena::process(tbb::internal::generic_scheduler&) () from > /lib64/libtbb.so.2 > > >>> > > >>> No symbol table info available. > > >>> > > >>> #10 0x00007fb982340c8b in tbb::internal::market::process(rml::job&) > () from /lib64/libtbb.so.2 > > >>> > > >>> No symbol table info available. > > >>> > > >>> #11 0x00007fb98233e67f in tbb::internal::rml::private_worker::run() > () from /lib64/libtbb.so.2 > > >>> > > >>> No symbol table info available. > > >>> > > >>> #12 0x00007fb98233e879 in > tbb::internal::rml::private_worker::thread_routine(void*) () from > /lib64/libtbb.so.2 > > >>> > > >>> No symbol table info available. > > >>> > > >>> #13 0x00007fb982560e25 in start_thread () from /lib64/libpthread.so.0 > > >>> > > >>> No symbol table info available. > > >>> > > >>> #14 0x00007fb98183934d in clone () from /lib64/libc.so.6 > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> Thanks! > > >>> > > >>> Pulkit Tandon > > >>> > > >>> > > >>> > > >> > > > > > > > > > > > >
Michael, it helped for me.
Regards,
Andrey.
вт, 17 апр. 2018 г., 18:41 Michael Henkel <email address hidden>:
> ok, let me know how it goes. /review. opencontrail. org/#/c/ 41928/1/ containers/ external/ cassandra/ contrail- entrypoint. sh /github. com/cloudscalin g/juniper- ci/blob/ master/ contrail- containers/ ansible/ instances. yaml.tmpl# L69 /review. opencontrail. org/#/c/ 41767/1/ containers/ external/ cassandra/ contrail- entrypoint. sh 10-1-56- 124:/# nodetool -p 7200 status ======= ======= == Leaving/ Joining/ Moving 53ee-4242- 971f-3015ccedc6 c2 rack1 3e9c-417d- b25c-7abf5e1f94 aa rack1 f3e2-4430- 86b4-261b0ffbaa 0e rack1 10-1-56- 124:/# nodetool -p 7200 statusgossip 10-1-56- 124:/# nodetool -p 7200 statusthrift 10-1-56- 124:/# nodetool -p 7200 statusbinary Cassandra[ ] connection down) Cassandra[ ] connection down) Cassandra[ ] connection down) Cassandra[ ] connection down) 10-1-56- 124 ~]# free -hw 217.52( nodeg12) 217.71( nodeg31) 217.98( nodec58) 217.100( nodec60) 217.101( nodec61) ,{error, {inconsistent_ cluster, "Node 127.0.0. 1:9091/ vm/7a271412- 4237-11e8- 8997-002590c55f 6a kube-cni. go:67: Failed kube-cni. go:67: Failed e::ConfigEventH andler( IFMapNode* , DBEntry*) () Manager: :ProcessChangeL ist() () :WorkerTask: :Run() () :custom_ scheduler< tbb::internal: :IntelScheduler Traits> ::local_ wait_for_ all(tbb: :task&, :arena: :process( tbb::internal: :generic_ scheduler& ) () from :market: :process( rml::job& ) :rml::private_ worker: :run() :rml::private_ worker: :thread_ routine( void*) () from libpthread. so.0
> Regards,
> Michael
>
> > On Apr 17, 2018, at 8:05 AM, Andrey Pavlov <email address hidden> wrote:
> >
> > Alexey added JVM_EXTRA_OPTS to cassandra's container here
> https:/
> > Now I'm checking this way
> https:/
> >
> >
> > Regards,
> > Andrey Pavlov.
> >
> > On Tue, Apr 17, 2018 at 6:02 PM, Michael Henkel <email address hidden>
> wrote:
> > And since then we have the cassandra problems? The symptoms clearly
> point towards memory shortage.
> > We have to expose the heap size as a parameter, otherwise Java is
> running crazy.
> > Regards,
> > Michael
> >
> > > On Apr 17, 2018, at 7:21 AM, Andrey Pavlov <email address hidden>
> wrote:
> > >
> > > btw, memory change for cassandra was merged recently -
> https:/
> > >
> > > Regards,
> > > Andrey Pavlov.
> > >
> > > On Tue, Apr 17, 2018 at 4:19 PM, Andrey Pavlov <email address hidden>
> wrote:
> > > root@node-
> > > Datacenter: datacenter1
> > > =======
> > > Status=Up/Down
> > > |/ State=Normal/
> > > -- Address Load Tokens Owns (effective) Host ID
> Rack
> > > UN 10.1.56.125 3.11 MiB 256 68.5%
> 468a1809-
> > > UN 10.1.56.124 1.89 MiB 256 72.2%
> 9aa41a48-
> > > UN 10.1.56.126 3.63 MiB 256 59.3%
> 33e498c9-
> > >
> > > root@node-
> > > running
> > > root@node-
> > > running
> > > root@node-
> > > running
> > >
> > >
> > > Regards,
> > > Andrey Pavlov.
> > >
> > > On Tue, Apr 17, 2018 at 4:16 PM, Michael Henkel <email address hidden>
> wrote:
> > > Hi Andrey, did you check nodetool status?
> > >
> > > Regards,
> > > Michael
> > >
> > > Am 17.04.2018 um 06:04 schrieb Andrey Pavlov <email address hidden>:
> > >
> > >> Hey Michael,
> > >>
> > >> I have similar problems in my 3-nodes setup:
> > >>
> > >> == Contrail control ==
> > >> control: active
> > >> nodemgr: active
> > >> named: active
> > >> dns: active
> > >>
> > >> == Contrail analytics ==
> > >> snmp-collector: initializing (Database:
> > >> query-engine: active
> > >> api: active
> > >> alarm-gen: initializing (Database:
> > >> nodemgr: active
> > >> collector: initializing (Database:Cassandra connection down)
> > >> topology: initializing (Database:
> > >>
> > >> == Contrail config ==
> > >> api: initializing (Database:
> > >> zookeeper: active
> > >> svc-monitor: backup
> > >> nodemgr: active
> > >> device-manager: backup
> > >> cassandra: active
> > >> rabbitmq: active
> > >> schema: backup
> > >>
> > >> == Contrail webui ==
> > >> web: active
> > >> job: active
> > >>
> > >> == Contrail database ==
> > >> kafka: active
> > >> nodemgr: active
> > >> zookeeper: active
> > >> cassandra: active
> > >>
> > >> [root@node-
> > >> total used free shared buffers
> cache available
> > >> Mem: 15G 11G 3.3G 28M 0B
> 892M 3.7G
> > >> Swap: 0B 0B 0B
> > >>
> > >>
> > >> Regards,
> > >> Andrey Pavlov.
> > >>
> > >> On Tue, Apr 17, 2018 at 3:57 PM, Michael Henkel <email address hidden>
> wrote:
> > >> Pulkit,
> > >>
> > >> How many resources did you assign to your instances?
> > >>
> > >> Regards,
> > >> Michael
> > >>
> > >> Am 17.04.2018 um 05:37 schrieb Pulkit Tandon <email address hidden>:
> > >>
> > >>> Hi All,
> > >>>
> > >>>
> > >>>
> > >>> I need your help and expertise debugging the k8s sanity setup which
> is in really bad state. Things are messier starting build 15.
> > >>>
> > >>> I observed multiple problems on current attempt. Not sure if they
> are linked or all are different.
> > >>>
> > >>> Kept the setup in same setup so that you can debug the failures on
> live setup.
> > >>>
> > >>>
> > >>>
> > >>> K8s HA Setup details:
> > >>>
> > >>> 3 Controller+kube managers:
> > >>>
> > >>> 10.204.
> > >>>
> > >>> 10.204.
> > >>>
> > >>> 10.204.
> > >>>
> > >>> 2 Agents/ k8s slave:
> > >>>
> > >>> 10.204.
> > >>>
> > >>> 10.204.
> > >>>
> > >>> Multi interface setup
> > >>>
> > >>>
> > >>>
> > >>> Following are key observations:
> > >>>
> > >>> 1. RabbitMQ cluster formed between nodeg12 and nodeg31.
> Nodec58 has rabbitmq as inactive.
> > >>>
> > >>> rabbitmq: inactive
> > >>>
> > >>> Docker logs for rabbitmq container on nodec58:
> > >>>
> > >>> {"init terminating in do_boot"
> contrail@nodec58 thinks it's clustered with node contrail@nodeg31, but
> contrail@nodeg31 disagrees"}}}
> > >>>
> > >>>
> > >>>
> > >>> 2. On all 3 controllers, Cassandra connection not established
> for 2 hours after provisioning. This issue seems flapping with time and
> sometimes, I see the services as active too:
> > >>> control: initializing (Database:Cassandra connection down)
> > >>> collector: initializing (Database:Cassandra connection down)
> > >>>
> > >>>
> > >>>
> > >>> 3. If I create a k8s Pod, many a times it results in POD
> creation failure and instantly vrouter crash happens.
> > >>> The trace is below.
> > >>> Irrespective of crash happens or not, POD creation fails
> > >>>
> > >>>
> > >>>
> > >>> 4. ON CNI of both agent, seeing this error:
> > >>> I : 24646 : 2018/04/17 17:35:44 vrouter.go:79: VRouter request.
> Operation : GET Url :
> http://
> > >>>
> > >>> E : 24646 : 2018/04/17 17:35:44 vrouter.go:147: Failed HTTP Get
> operation. Return code 404
> > >>>
> > >>> I : 24646 : 2018/04/17 17:35:44 vrouter.go:181: Iteration 14 : Get
> vrouter failed
> > >>>
> > >>> E : 24633 : 2018/04/17 17:35:49 vrouter.go:287: Error in polling
> VRouter
> > >>>
> > >>> I : 24633 : 2018/04/17 17:35:49 cni.go:175: Error in Add to VRouter
> > >>>
> > >>> E : 24633 : 2018/04/17 17:35:49 contrail-
> processing Add command.
> > >>>
> > >>> E : 24646 : 2018/04/17 17:35:49 vrouter.go:287: Error in polling
> VRouter
> > >>>
> > >>> I : 24646 : 2018/04/17 17:35:49 cni.go:175: Error in Add to VRouter
> > >>>
> > >>> E : 24646 : 2018/04/17 17:35:49 contrail-
> processing Add command.
> > >>>
> > >>>
> > >>>
> > >>> NOTE: Most of the issues observed are on k8s HA multi interface
> setup.
> > >>>
> > >>> Things are better with Non HA/ single interface setup.
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> Agent crash trace:
> > >>>
> > >>> (gdb) bt full
> > >>>
> > >>> #0 0x00007fb9817761f7 in raise () from /lib64/libc.so.6
> > >>>
> > >>> No symbol table info available.
> > >>>
> > >>> #1 0x00007fb9817778e8 in abort () from /lib64/libc.so.6
> > >>>
> > >>> No symbol table info available.
> > >>>
> > >>> #2 0x00007fb98176f266 in __assert_fail_base () from /lib64/libc.so.6
> > >>>
> > >>> No symbol table info available.
> > >>>
> > >>> #3 0x00007fb98176f312 in __assert_fail () from /lib64/libc.so.6
> > >>>
> > >>> No symbol table info available.
> > >>>
> > >>> #4 0x0000000000c15440 in
> AgentOperDBTabl
> > >>>
> > >>> No symbol table info available.
> > >>>
> > >>> #5 0x0000000000c41714 in
> IFMapDependency
> > >>>
> > >>> No symbol table info available.
> > >>>
> > >>> #6 0x0000000000ea4a57 in TaskTrigger:
> > >>>
> > >>> No symbol table info available.
> > >>>
> > >>> #7 0x0000000000e9e64f in TaskImpl::execute() ()
> > >>>
> > >>> No symbol table info available.
> > >>>
> > >>> #8 0x00007fb9823458ca in
> tbb::internal:
> tbb::task*) () from /lib64/libtbb.so.2
> > >>>
> > >>> No symbol table info available.
> > >>>
> > >>> #9 0x00007fb9823415b6 in
> tbb::internal:
> /lib64/libtbb.so.2
> > >>>
> > >>> No symbol table info available.
> > >>>
> > >>> #10 0x00007fb982340c8b in tbb::internal:
> () from /lib64/libtbb.so.2
> > >>>
> > >>> No symbol table info available.
> > >>>
> > >>> #11 0x00007fb98233e67f in tbb::internal:
> () from /lib64/libtbb.so.2
> > >>>
> > >>> No symbol table info available.
> > >>>
> > >>> #12 0x00007fb98233e879 in
> tbb::internal:
> /lib64/libtbb.so.2
> > >>>
> > >>> No symbol table info available.
> > >>>
> > >>> #13 0x00007fb982560e25 in start_thread () from /lib64/
> > >>>
> > >>> No symbol table info available.
> > >>>
> > >>> #14 0x00007fb98183934d in clone () from /lib64/libc.so.6
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> Thanks!
> > >>>
> > >>> Pulkit Tandon
> > >>>
> > >>>
> > >>>
> > >>
> > >
> > >
> >
> >
>
>