Alexey added JVM_EXTRA_OPTS to cassandra's container here https://review.opencontrail.org/#/c/41928/1/containers/external/cassandra/contrail-entrypoint.sh Now I'm checking this way https://github.com/cloudscaling/juniper-ci/blob/master/contrail-containers/ansible/instances.yaml.tmpl#L69
Regards, Andrey Pavlov.
On Tue, Apr 17, 2018 at 6:02 PM, Michael Henkel <email address hidden> wrote:
> And since then we have the cassandra problems? The symptoms clearly point > towards memory shortage. > We have to expose the heap size as a parameter, otherwise Java is running > crazy. > Regards, > Michael > > > On Apr 17, 2018, at 7:21 AM, Andrey Pavlov <email address hidden> wrote: > > > > btw, memory change for cassandra was merged recently - > https://review.opencontrail.org/#/c/41767/1/containers/ > external/cassandra/contrail-entrypoint.sh > > > > Regards, > > Andrey Pavlov. > > > > On Tue, Apr 17, 2018 at 4:19 PM, Andrey Pavlov <email address hidden> > wrote: > > root@node-10-1-56-124:/# nodetool -p 7200 status > > Datacenter: datacenter1 > > ======================= > > Status=Up/Down > > |/ State=Normal/Leaving/Joining/Moving > > -- Address Load Tokens Owns (effective) Host ID > Rack > > UN 10.1.56.125 3.11 MiB 256 68.5% > 468a1809-53ee-4242-971f-3015ccedc6c2 rack1 > > UN 10.1.56.124 1.89 MiB 256 72.2% > 9aa41a48-3e9c-417d-b25c-7abf5e1f94aa rack1 > > UN 10.1.56.126 3.63 MiB 256 59.3% > 33e498c9-f3e2-4430-86b4-261b0ffbaa0e rack1 > > > > root@node-10-1-56-124:/# nodetool -p 7200 statusgossip > > running > > root@node-10-1-56-124:/# nodetool -p 7200 statusthrift > > running > > root@node-10-1-56-124:/# nodetool -p 7200 statusbinary > > running > > > > > > Regards, > > Andrey Pavlov. > > > > On Tue, Apr 17, 2018 at 4:16 PM, Michael Henkel <email address hidden> > wrote: > > Hi Andrey, did you check nodetool status? > > > > Regards, > > Michael > > > > Am 17.04.2018 um 06:04 schrieb Andrey Pavlov <email address hidden>: > > > >> Hey Michael, > >> > >> I have similar problems in my 3-nodes setup: > >> > >> == Contrail control == > >> control: active > >> nodemgr: active > >> named: active > >> dns: active > >> > >> == Contrail analytics == > >> snmp-collector: initializing (Database:Cassandra[] connection down) > >> query-engine: active > >> api: active > >> alarm-gen: initializing (Database:Cassandra[] connection down) > >> nodemgr: active > >> collector: initializing (Database:Cassandra connection down) > >> topology: initializing (Database:Cassandra[] connection down) > >> > >> == Contrail config == > >> api: initializing (Database:Cassandra[] connection down) > >> zookeeper: active > >> svc-monitor: backup > >> nodemgr: active > >> device-manager: backup > >> cassandra: active > >> rabbitmq: active > >> schema: backup > >> > >> == Contrail webui == > >> web: active > >> job: active > >> > >> == Contrail database == > >> kafka: active > >> nodemgr: active > >> zookeeper: active > >> cassandra: active > >> > >> [root@node-10-1-56-124 ~]# free -hw > >> total used free shared buffers > cache available > >> Mem: 15G 11G 3.3G 28M 0B > 892M 3.7G > >> Swap: 0B 0B 0B > >> > >> > >> Regards, > >> Andrey Pavlov. > >> > >> On Tue, Apr 17, 2018 at 3:57 PM, Michael Henkel <email address hidden> > wrote: > >> Pulkit, > >> > >> How many resources did you assign to your instances? > >> > >> Regards, > >> Michael > >> > >> Am 17.04.2018 um 05:37 schrieb Pulkit Tandon <email address hidden>: > >> > >>> Hi All, > >>> > >>> > >>> > >>> I need your help and expertise debugging the k8s sanity setup which is > in really bad state. Things are messier starting build 15. > >>> > >>> I observed multiple problems on current attempt. Not sure if they are > linked or all are different. > >>> > >>> Kept the setup in same setup so that you can debug the failures on > live setup. > >>> > >>> > >>> > >>> K8s HA Setup details: > >>> > >>> 3 Controller+kube managers: > >>> > >>> 10.204.217.52(nodeg12) > >>> > >>> 10.204.217.71(nodeg31) > >>> > >>> 10.204.217.98(nodec58) > >>> > >>> 2 Agents/ k8s slave: > >>> > >>> 10.204.217.100(nodec60) > >>> > >>> 10.204.217.101(nodec61) > >>> > >>> Multi interface setup > >>> > >>> > >>> > >>> Following are key observations: > >>> > >>> 1. RabbitMQ cluster formed between nodeg12 and nodeg31. Nodec58 > has rabbitmq as inactive. > >>> > >>> rabbitmq: inactive > >>> > >>> Docker logs for rabbitmq container on nodec58: > >>> > >>> {"init terminating in do_boot",{error,{inconsistent_cluster,"Node > contrail@nodec58 thinks it's clustered with node contrail@nodeg31, but > contrail@nodeg31 disagrees"}}} > >>> > >>> > >>> > >>> 2. On all 3 controllers, Cassandra connection not established > for 2 hours after provisioning. This issue seems flapping with time and > sometimes, I see the services as active too: > >>> control: initializing (Database:Cassandra connection down) > >>> collector: initializing (Database:Cassandra connection down) > >>> > >>> > >>> > >>> 3. If I create a k8s Pod, many a times it results in POD > creation failure and instantly vrouter crash happens. > >>> The trace is below. > >>> Irrespective of crash happens or not, POD creation fails > >>> > >>> > >>> > >>> 4. ON CNI of both agent, seeing this error: > >>> I : 24646 : 2018/04/17 17:35:44 vrouter.go:79: VRouter request. > Operation : GET Url : http://127.0.0.1:9091/vm/7a271412-4237-11e8-8997- > 002590c55f6a > >>> > >>> E : 24646 : 2018/04/17 17:35:44 vrouter.go:147: Failed HTTP Get > operation. Return code 404 > >>> > >>> I : 24646 : 2018/04/17 17:35:44 vrouter.go:181: Iteration 14 : Get > vrouter failed > >>> > >>> E : 24633 : 2018/04/17 17:35:49 vrouter.go:287: Error in polling > VRouter > >>> > >>> I : 24633 : 2018/04/17 17:35:49 cni.go:175: Error in Add to VRouter > >>> > >>> E : 24633 : 2018/04/17 17:35:49 contrail-kube-cni.go:67: Failed > processing Add command. > >>> > >>> E : 24646 : 2018/04/17 17:35:49 vrouter.go:287: Error in polling > VRouter > >>> > >>> I : 24646 : 2018/04/17 17:35:49 cni.go:175: Error in Add to VRouter > >>> > >>> E : 24646 : 2018/04/17 17:35:49 contrail-kube-cni.go:67: Failed > processing Add command. > >>> > >>> > >>> > >>> NOTE: Most of the issues observed are on k8s HA multi interface setup. > >>> > >>> Things are better with Non HA/ single interface setup. > >>> > >>> > >>> > >>> > >>> > >>> Agent crash trace: > >>> > >>> (gdb) bt full > >>> > >>> #0 0x00007fb9817761f7 in raise () from /lib64/libc.so.6 > >>> > >>> No symbol table info available. > >>> > >>> #1 0x00007fb9817778e8 in abort () from /lib64/libc.so.6 > >>> > >>> No symbol table info available. > >>> > >>> #2 0x00007fb98176f266 in __assert_fail_base () from /lib64/libc.so.6 > >>> > >>> No symbol table info available. > >>> > >>> #3 0x00007fb98176f312 in __assert_fail () from /lib64/libc.so.6 > >>> > >>> No symbol table info available. > >>> > >>> #4 0x0000000000c15440 in AgentOperDBTable::ConfigEventHandler(IFMapNode*, > DBEntry*) () > >>> > >>> No symbol table info available. > >>> > >>> #5 0x0000000000c41714 in IFMapDependencyManager::ProcessChangeList() > () > >>> > >>> No symbol table info available. > >>> > >>> #6 0x0000000000ea4a57 in TaskTrigger::WorkerTask::Run() () > >>> > >>> No symbol table info available. > >>> > >>> #7 0x0000000000e9e64f in TaskImpl::execute() () > >>> > >>> No symbol table info available. > >>> > >>> #8 0x00007fb9823458ca in tbb::internal::custom_ > scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all(tbb::task&, > tbb::task*) () from /lib64/libtbb.so.2 > >>> > >>> No symbol table info available. > >>> > >>> #9 0x00007fb9823415b6 in tbb::internal::arena::process( > tbb::internal::generic_scheduler&) () from /lib64/libtbb.so.2 > >>> > >>> No symbol table info available. > >>> > >>> #10 0x00007fb982340c8b in tbb::internal::market::process(rml::job&) > () from /lib64/libtbb.so.2 > >>> > >>> No symbol table info available. > >>> > >>> #11 0x00007fb98233e67f in tbb::internal::rml::private_worker::run() > () from /lib64/libtbb.so.2 > >>> > >>> No symbol table info available. > >>> > >>> #12 0x00007fb98233e879 in tbb::internal::rml::private_worker::thread_routine(void*) > () from /lib64/libtbb.so.2 > >>> > >>> No symbol table info available. > >>> > >>> #13 0x00007fb982560e25 in start_thread () from /lib64/libpthread.so.0 > >>> > >>> No symbol table info available. > >>> > >>> #14 0x00007fb98183934d in clone () from /lib64/libc.so.6 > >>> > >>> > >>> > >>> > >>> > >>> Thanks! > >>> > >>> Pulkit Tandon > >>> > >>> > >>> > >> > > > > > >
Alexey added JVM_EXTRA_OPTS to cassandra's container here /review. opencontrail. org/#/c/ 41928/1/ containers/ external/ cassandra/ contrail- entrypoint. sh /github. com/cloudscalin g/juniper- ci/blob/ master/ contrail- containers/ ansible/ instances. yaml.tmpl# L69
https:/
Now I'm checking this way
https:/
Regards,
Andrey Pavlov.
On Tue, Apr 17, 2018 at 6:02 PM, Michael Henkel <email address hidden> wrote:
> And since then we have the cassandra problems? The symptoms clearly point /review. opencontrail. org/#/c/ 41767/1/ containers/ cassandra/ contrail- entrypoint. sh 10-1-56- 124:/# nodetool -p 7200 status ======= ======= == Leaving/ Joining/ Moving 53ee-4242- 971f-3015ccedc6 c2 rack1 3e9c-417d- b25c-7abf5e1f94 aa rack1 f3e2-4430- 86b4-261b0ffbaa 0e rack1 10-1-56- 124:/# nodetool -p 7200 statusgossip 10-1-56- 124:/# nodetool -p 7200 statusthrift 10-1-56- 124:/# nodetool -p 7200 statusbinary Cassandra[ ] connection down) Cassandra[ ] connection down) Cassandra[ ] connection down) Cassandra[ ] connection down) 10-1-56- 124 ~]# free -hw 217.52( nodeg12) 217.71( nodeg31) 217.98( nodec58) 217.100( nodec60) 217.101( nodec61) ,{error, {inconsistent_ cluster, "Node 127.0.0. 1:9091/ vm/7a271412- 4237-11e8- 8997- kube-cni. go:67: Failed kube-cni. go:67: Failed e::ConfigEventH andler( IFMapNode* , Manager: :ProcessChangeL ist() :WorkerTask: :Run() () :custom_ tbb::internal: :IntelScheduler Traits> ::local_ wait_for_ all(tbb: :task&, :arena: :process( :generic_ scheduler& ) () from /lib64/libtbb.so.2 :market: :process( rml::job& ) :rml::private_ worker: :run() :rml::private_ worker: :thread_ routine( void*) libpthread. so.0
> towards memory shortage.
> We have to expose the heap size as a parameter, otherwise Java is running
> crazy.
> Regards,
> Michael
>
> > On Apr 17, 2018, at 7:21 AM, Andrey Pavlov <email address hidden> wrote:
> >
> > btw, memory change for cassandra was merged recently -
> https:/
> external/
> >
> > Regards,
> > Andrey Pavlov.
> >
> > On Tue, Apr 17, 2018 at 4:19 PM, Andrey Pavlov <email address hidden>
> wrote:
> > root@node-
> > Datacenter: datacenter1
> > =======
> > Status=Up/Down
> > |/ State=Normal/
> > -- Address Load Tokens Owns (effective) Host ID
> Rack
> > UN 10.1.56.125 3.11 MiB 256 68.5%
> 468a1809-
> > UN 10.1.56.124 1.89 MiB 256 72.2%
> 9aa41a48-
> > UN 10.1.56.126 3.63 MiB 256 59.3%
> 33e498c9-
> >
> > root@node-
> > running
> > root@node-
> > running
> > root@node-
> > running
> >
> >
> > Regards,
> > Andrey Pavlov.
> >
> > On Tue, Apr 17, 2018 at 4:16 PM, Michael Henkel <email address hidden>
> wrote:
> > Hi Andrey, did you check nodetool status?
> >
> > Regards,
> > Michael
> >
> > Am 17.04.2018 um 06:04 schrieb Andrey Pavlov <email address hidden>:
> >
> >> Hey Michael,
> >>
> >> I have similar problems in my 3-nodes setup:
> >>
> >> == Contrail control ==
> >> control: active
> >> nodemgr: active
> >> named: active
> >> dns: active
> >>
> >> == Contrail analytics ==
> >> snmp-collector: initializing (Database:
> >> query-engine: active
> >> api: active
> >> alarm-gen: initializing (Database:
> >> nodemgr: active
> >> collector: initializing (Database:Cassandra connection down)
> >> topology: initializing (Database:
> >>
> >> == Contrail config ==
> >> api: initializing (Database:
> >> zookeeper: active
> >> svc-monitor: backup
> >> nodemgr: active
> >> device-manager: backup
> >> cassandra: active
> >> rabbitmq: active
> >> schema: backup
> >>
> >> == Contrail webui ==
> >> web: active
> >> job: active
> >>
> >> == Contrail database ==
> >> kafka: active
> >> nodemgr: active
> >> zookeeper: active
> >> cassandra: active
> >>
> >> [root@node-
> >> total used free shared buffers
> cache available
> >> Mem: 15G 11G 3.3G 28M 0B
> 892M 3.7G
> >> Swap: 0B 0B 0B
> >>
> >>
> >> Regards,
> >> Andrey Pavlov.
> >>
> >> On Tue, Apr 17, 2018 at 3:57 PM, Michael Henkel <email address hidden>
> wrote:
> >> Pulkit,
> >>
> >> How many resources did you assign to your instances?
> >>
> >> Regards,
> >> Michael
> >>
> >> Am 17.04.2018 um 05:37 schrieb Pulkit Tandon <email address hidden>:
> >>
> >>> Hi All,
> >>>
> >>>
> >>>
> >>> I need your help and expertise debugging the k8s sanity setup which is
> in really bad state. Things are messier starting build 15.
> >>>
> >>> I observed multiple problems on current attempt. Not sure if they are
> linked or all are different.
> >>>
> >>> Kept the setup in same setup so that you can debug the failures on
> live setup.
> >>>
> >>>
> >>>
> >>> K8s HA Setup details:
> >>>
> >>> 3 Controller+kube managers:
> >>>
> >>> 10.204.
> >>>
> >>> 10.204.
> >>>
> >>> 10.204.
> >>>
> >>> 2 Agents/ k8s slave:
> >>>
> >>> 10.204.
> >>>
> >>> 10.204.
> >>>
> >>> Multi interface setup
> >>>
> >>>
> >>>
> >>> Following are key observations:
> >>>
> >>> 1. RabbitMQ cluster formed between nodeg12 and nodeg31. Nodec58
> has rabbitmq as inactive.
> >>>
> >>> rabbitmq: inactive
> >>>
> >>> Docker logs for rabbitmq container on nodec58:
> >>>
> >>> {"init terminating in do_boot"
> contrail@nodec58 thinks it's clustered with node contrail@nodeg31, but
> contrail@nodeg31 disagrees"}}}
> >>>
> >>>
> >>>
> >>> 2. On all 3 controllers, Cassandra connection not established
> for 2 hours after provisioning. This issue seems flapping with time and
> sometimes, I see the services as active too:
> >>> control: initializing (Database:Cassandra connection down)
> >>> collector: initializing (Database:Cassandra connection down)
> >>>
> >>>
> >>>
> >>> 3. If I create a k8s Pod, many a times it results in POD
> creation failure and instantly vrouter crash happens.
> >>> The trace is below.
> >>> Irrespective of crash happens or not, POD creation fails
> >>>
> >>>
> >>>
> >>> 4. ON CNI of both agent, seeing this error:
> >>> I : 24646 : 2018/04/17 17:35:44 vrouter.go:79: VRouter request.
> Operation : GET Url : http://
> 002590c55f6a
> >>>
> >>> E : 24646 : 2018/04/17 17:35:44 vrouter.go:147: Failed HTTP Get
> operation. Return code 404
> >>>
> >>> I : 24646 : 2018/04/17 17:35:44 vrouter.go:181: Iteration 14 : Get
> vrouter failed
> >>>
> >>> E : 24633 : 2018/04/17 17:35:49 vrouter.go:287: Error in polling
> VRouter
> >>>
> >>> I : 24633 : 2018/04/17 17:35:49 cni.go:175: Error in Add to VRouter
> >>>
> >>> E : 24633 : 2018/04/17 17:35:49 contrail-
> processing Add command.
> >>>
> >>> E : 24646 : 2018/04/17 17:35:49 vrouter.go:287: Error in polling
> VRouter
> >>>
> >>> I : 24646 : 2018/04/17 17:35:49 cni.go:175: Error in Add to VRouter
> >>>
> >>> E : 24646 : 2018/04/17 17:35:49 contrail-
> processing Add command.
> >>>
> >>>
> >>>
> >>> NOTE: Most of the issues observed are on k8s HA multi interface setup.
> >>>
> >>> Things are better with Non HA/ single interface setup.
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> Agent crash trace:
> >>>
> >>> (gdb) bt full
> >>>
> >>> #0 0x00007fb9817761f7 in raise () from /lib64/libc.so.6
> >>>
> >>> No symbol table info available.
> >>>
> >>> #1 0x00007fb9817778e8 in abort () from /lib64/libc.so.6
> >>>
> >>> No symbol table info available.
> >>>
> >>> #2 0x00007fb98176f266 in __assert_fail_base () from /lib64/libc.so.6
> >>>
> >>> No symbol table info available.
> >>>
> >>> #3 0x00007fb98176f312 in __assert_fail () from /lib64/libc.so.6
> >>>
> >>> No symbol table info available.
> >>>
> >>> #4 0x0000000000c15440 in AgentOperDBTabl
> DBEntry*) ()
> >>>
> >>> No symbol table info available.
> >>>
> >>> #5 0x0000000000c41714 in IFMapDependency
> ()
> >>>
> >>> No symbol table info available.
> >>>
> >>> #6 0x0000000000ea4a57 in TaskTrigger:
> >>>
> >>> No symbol table info available.
> >>>
> >>> #7 0x0000000000e9e64f in TaskImpl::execute() ()
> >>>
> >>> No symbol table info available.
> >>>
> >>> #8 0x00007fb9823458ca in tbb::internal:
> scheduler<
> tbb::task*) () from /lib64/libtbb.so.2
> >>>
> >>> No symbol table info available.
> >>>
> >>> #9 0x00007fb9823415b6 in tbb::internal:
> tbb::internal:
> >>>
> >>> No symbol table info available.
> >>>
> >>> #10 0x00007fb982340c8b in tbb::internal:
> () from /lib64/libtbb.so.2
> >>>
> >>> No symbol table info available.
> >>>
> >>> #11 0x00007fb98233e67f in tbb::internal:
> () from /lib64/libtbb.so.2
> >>>
> >>> No symbol table info available.
> >>>
> >>> #12 0x00007fb98233e879 in tbb::internal:
> () from /lib64/libtbb.so.2
> >>>
> >>> No symbol table info available.
> >>>
> >>> #13 0x00007fb982560e25 in start_thread () from /lib64/
> >>>
> >>> No symbol table info available.
> >>>
> >>> #14 0x00007fb98183934d in clone () from /lib64/libc.so.6
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> Thanks!
> >>>
> >>> Pulkit Tandon
> >>>
> >>>
> >>>
> >>
> >
> >
>
>