nodemgr stays in initializing state ( Cassandra state detected DOWN.)

Bug #1780948 reported by vimal on 2018-07-10
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R5.0
Invalid
Critical
vimal
Trunk
Invalid
Critical
vimal

Bug Description

nodemgr stays in initializing state ( Cassandra state detected DOWN.) Exception: java.lang.OutOfMemoryError is seen in analytics_database_cassandra.

version
----------
ocata-5.0-134

commands
-----------

[root@nodem14 ~]#
[root@nodem14 ~]# contrail-status
Pod Service Original Name State Status
analytics alarm-gen contrail-analytics-alarm-gen running Up 5 hours
analytics api contrail-analytics-api running Up 5 hours
analytics collector contrail-analytics-collector running Up 5 hours
analytics nodemgr contrail-nodemgr running Up 5 hours
analytics query-engine contrail-analytics-query-engine running Up 5 hours
analytics snmp-collector contrail-analytics-snmp-collector running Up 5 hours
analytics topology contrail-analytics-topology running Up 5 hours
config api contrail-controller-config-api running Up About an hour
config cassandra contrail-external-cassandra running Up 5 hours
config device-manager contrail-controller-config-devicemgr running Up 5 hours
config nodemgr contrail-nodemgr running Up 5 hours
config rabbitmq contrail-external-rabbitmq running Up 5 hours
config schema contrail-controller-config-schema running Up 5 hours
config svc-monitor contrail-controller-config-svcmonitor running Up 5 hours
config zookeeper contrail-external-zookeeper running Up 5 hours
control control contrail-controller-control-control running Up 46 minutes
control dns contrail-controller-control-dns running Up 5 hours
control named contrail-controller-control-named running Up 5 hours
control nodemgr contrail-nodemgr running Up 5 hours
database cassandra contrail-external-cassandra running Up 5 hours
database kafka contrail-external-kafka running Up 5 hours
database nodemgr contrail-nodemgr running Up 5 hours
database zookeeper contrail-external-zookeeper running Up 5 hours
webui job contrail-controller-webui-job running Up 5 hours
webui web contrail-controller-webui-web running Up 5 hours

== Contrail control ==
control: active
nodemgr: active
named: active
dns: active

== Contrail analytics ==
snmp-collector: active
query-engine: active
api: active
alarm-gen: active
nodemgr: active
collector: active
topology: active

== Contrail config ==
api: active
zookeeper: active
svc-monitor: backup
nodemgr: active
device-manager: backup
cassandra: active
rabbitmq: active
schema: backup

== Contrail webui ==
web: active
job: active

== Contrail database ==
kafka: active
nodemgr: initializing (Cassandra state detected DOWN. )
zookeeper: active
cassandra: active

Below logs are seen in analytics_database_cassandra_1

INFO [Service Thread] 2018-07-10 07:10:17,733 StatusLogger.java:51 - GossipStage 1 72 17282 0 0

WARN [ScheduledTasks:2] 2018-07-10 07:10:30,933 NoSpamLogger.java:94 - Some operations timed out, details available at debug level (debug.log)
INFO [Service Thread] 2018-07-10 07:10:32,541 StatusLogger.java:51 - SecondaryIndexManagement 0 0 0 0 0

INFO [HintsDispatcher:10] 2018-07-10 07:10:46,187 HintsDispatchExecutor.java:289 - Finished hinted handoff of file 89f61382-4bfa-4534-805b-6ddb9c5f06ba-1531200407225-1.hints to endpoint /10.204.216.95: 89f61382-4bfa-4534-805b-6ddb9c5f06ba, partially

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "MessagingService-Incoming-/10.204.216.96"

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "RMI TCP Connection(idle)"

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "RMI TCP Connection(idle)"

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "RMI TCP Connection(idle)"

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "RMI TCP Connection(idle)"

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "MessagingService-Incoming-/10.204.216.95"

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "RMI TCP Connection(idle)"

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "RMI TCP Connection(idle)"

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "Reference-Reaper:1"

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "RMI TCP Connection(idle)"

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "MessagingService-Incoming-/10.204.216.95"

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "RMI TCP Connection(idle)"

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "RMI TCP Connection(idle)"

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "RMI TCP Connection(idle)"

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "RMI TCP Connection(idle)"
*** java.lang.instrument ASSERTION FAILED ***: "!errorOutstanding" with message can't create byte arrau at JPLISAgent.c line: 813
*** java.lang.instrument ASSERTION FAILED ***: "!errorOutstanding" with message can't create byte arrau at JPLISAgent.c line: 813

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "RMI TCP Connection(idle)"

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "MessagingService-Incoming-/10.204.216.95"
Jul 10, 2018 7:37:12 AM sun.rmi.transport.tcp.TCPTransport$AcceptLoop executeAcceptLoop
WARNING: RMI TCP Accept-7200: accept loop for ServerSocket[addr=localhost/127.0.0.1,localport=7200] throws
java.lang.OutOfMemoryError: Java heap space

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "RMI TCP Connection(idle)"
Jul 10, 2018 7:39:44 AM sun.rmi.transport.tcp.TCPTransport$AcceptLoop executeAcceptLoop
WARNING: RMI TCP Accept-7200: accept loop for ServerSocket[addr=localhost/127.0.0.1,localport=7200] throws
java.lang.OutOfMemoryError: Java heap space

ERROR [BatchlogTasks:1] 2018-07-10 07:20:23,338 JVMStabilityInspector.java:74 - OutOfMemory error letting the JVM handle the error:
java.lang.OutOfMemoryError: Java heap space

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "RMI TCP Connection(idle)"

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "RMI TCP Connection(idle)"

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "GossipStage:1"

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "RMI TCP Connection(idle)"
ERROR [Native-Transport-Requests-18] 2018-07-10 07:20:23,338 JVMStabilityInspector.java:74 - OutOfMemory error letting the JVM handle the error:

logs
------

/cs-shared/bugs/1780948
[vappachan@nodem3 1780948]$ ls
contrail-analytics-nodemgr.log contrail-collector.log contrail-config-nodemgr.log contrail-database-nodemgr.log logs
[

Changed in juniperopenstack:
assignee: nobody → Sundaresan Rajangam (srajanga)
vimal (vappachan) on 2018-07-10
description: updated
vimal (vappachan) on 2018-07-10
description: updated
Sundaresan Rajangam (srajanga) wrote :
Download full text (4.3 KiB)

cassandra is started with Xms 1g and Xmx 2g
This is incorrect. Xms and Xmx value should be 8g

cassand+ 1 0 99 04:57 ? 1-00:25:02 java -Xloggc:/var/log/cassandra/gc.log -ea -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=1000003 -XX:+AlwaysPreTouch -XX:-UseBiasedLocking -XX:+UseTLAB -XX:+ResizeTLAB -XX:+UseNUMA -XX:+PerfDisableSharedMem -Djava.net.preferIPv4Stack=true -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSWaitDuration=10000 -XX:+CMSParallelInitialMarkEnabled -XX:+CMSEdenChunksRecordAlways -XX:+CMSClassUnloadingEnabled -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintHeapAtGC -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime -XX:+PrintPromotionFailure -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=10M -Xms8192M -Xmx8192M -Xmn2048M -XX:CompileCommandFile=/etc/cassandra/hotspot_compiler -javaagent:/usr/share/cassandra/lib/jamm-0.3.0.jar -Dcassandra.jmx.local.port=7199 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.password.file=/etc/cassandra/jmxremote.password -Djava.library.path=/usr/share/cassandra/lib/sigar-bin -Xms1g -Xmx2g -Dcassandra.rpc_port=9160 -Dcassandra.native_transport_port=9042 -Dcassandra.ssl_storage_port=7011 -Dcassandra.storage_port=7010 -Dcassandra.jmx.local.port=7200 -Dcassandra.libjemalloc=/usr/lib/x86_64-linux-gnu/libjemalloc.so.1 -XX:OnOutOfMemoryError=kill -9 %p -Dlogback.configurationFile=logback.xml -Dcassandra.logdir=/var/log/cassandra -Dcassandra.storagedir=/var/lib/cassandra -Dcassandra-foreground=yes -cp /etc/cassandra:/usr/share/cassandra/lib/HdrHistogram-2.1.9.jar:/usr/share/cassandra/lib/ST4-4.0.8.jar:/usr/share/cassandra/lib/airline-0.6.jar:/usr/share/cassandra/lib/antlr-runtime-3.5.2.jar:/usr/share/cassandra/lib/asm-5.0.4.jar:/usr/share/cassandra/lib/caffeine-2.2.6.jar:/usr/share/cassandra/lib/cassandra-driver-core-3.0.1-shaded.jar:/usr/share/cassandra/lib/commons-cli-1.1.jar:/usr/share/cassandra/lib/commons-codec-1.9.jar:/usr/share/cassandra/lib/commons-lang3-3.1.jar:/usr/share/cassandra/lib/commons-math3-3.2.jar:/usr/share/cassandra/lib/compress-lzf-0.8.4.jar:/usr/share/cassandra/lib/concurrent-trees-2.4.0.jar:/usr/share/cassandra/lib/concurrentlinkedhashmap-lru-1.4.jar:/usr/share/cassandra/lib/disruptor-3.0.1.jar:/usr/share/cassandra/lib/ecj-4.4.2.jar:/usr/share/cassandra/lib/guava-18.0.jar:/usr/share/cassandra/lib/high-scale-lib-1.0.6.jar:/usr/share/cassandra/lib/hppc-0.5.4.jar:/usr/share/cassandra/lib/jackson-core-asl-1.9.13.jar:/usr/share/cassandra/lib/jackson-mapper-asl-1.9.13.jar:/usr/share/cassandra/lib/jamm-0.3.0.jar:/usr/share/cassandra/lib/javax.inject.jar:/usr/share/cassandra/lib/jbcrypt-0.3m.jar:/usr/share/cassandra/lib/jcl-over-slf4j-1.7.7.jar:/usr/share/cassandra/lib/jctools-core-1.2.1.jar:/usr/share/cassandra/lib/jflex-1.6.0.jar:/usr/share/cassandra/lib/jna-4.2.2.jar:/usr/share/cassandra/lib/joda-time-2.4.jar:/usr/share/cassandra/lib/json-simple-1.1.jar:/usr/share/cassandra/lib/jsta...

Read more...

Sudheendra Rao (sudheendra-k) wrote :

the problem is not seen after removing the JVM_EXTRA_OPT,
hence removing the sanityblocker tag

tags: removed: sanityblocker
vimal (vappachan) wrote :
Download full text (3.4 KiB)

This issue is seen intermittently . In instances.yaml JVM_EXTRA_OPT is removed. Below is the status with ocata-5.0-137 . Logs are in /cs-shared/bugs/1780948/build137

[root@nodem14 contrail-ansible-deployer]# contrail-status
Pod Service Original Name State Status
analytics alarm-gen contrail-analytics-alarm-gen running Up 10 hours
analytics api contrail-analytics-api running Up 10 hours
analytics collector contrail-analytics-collector running Up 10 hours
analytics nodemgr contrail-nodemgr running Up 10 hours
analytics query-engine contrail-analytics-query-engine running Up 10 hours
analytics snmp-collector contrail-analytics-snmp-collector running Up 10 hours
analytics topology contrail-analytics-topology running Up 10 hours
config api contrail-controller-config-api running Up 7 hours
config device-manager contrail-controller-config-devicemgr running Up 10 hours
config nodemgr contrail-nodemgr running Up 10 hours
config schema contrail-controller-config-schema running Up 10 hours
config svc-monitor contrail-controller-config-svcmonitor running Up 10 hours
config-database cassandra contrail-external-cassandra running Up 10 hours
config-database nodemgr contrail-nodemgr restarting Restarting (0) 3 hours ago
config-database rabbitmq contrail-external-rabbitmq running Up 10 hours
config-database zookeeper contrail-external-zookeeper running Up 10 hours
control control contrail-controller-control-control running Up 7 hours
control dns contrail-controller-control-dns running Up 10 hours
control named contrail-controller-control-named running Up 10 hours
control nodemgr contrail-nodemgr running Up 10 hours
database cassandra contrail-external-cassandra running Up 10 hours
database kafka contrail-external-kafka running Up 10 hours
database nodemgr contrail-nodemgr running Up 10 hours
database zookeeper contrail-external-zookeeper running Up 10 hours
webui job contrail-controller-webui-job running Up 10 hours
webui web contrail-controller-webui-web running Up 10 hours

== Contrail control ==
control: active
nodemgr: active
named: active
dns: active

== Contrail config-database ==

== Contrail database ==
kafka: active
nodemgr: active
zookeeper: active
cassandra: active

== Contrail analytics ==
snmp-collector: active
query-engine: active
api: active
alarm-gen: active
nodemgr: active
collector: active
topology: active

== Contrail webui ...

Read more...

tags: added: sanityblocker
Santosh Gupta (sangupta) wrote :

I see this on system.log on config-cassandra.

WARN [main] 2018-07-12 09:26:12,892 NativeLibrary.java:187 - Unable to lock JVM memory (ENOMEM). This can result in part of the JVM being swapped out, especially with mmapped I/O enabled. Increase RLIMIT_MEMLOCK or run Cassandra as root.

[root@nodec7 sangupta]# free -g
              total used free shared buff/cache available
Mem: 31 25 0 0 4 5
Swap: 0 0 0

[root@nodec7 sangupta]# top -o %MEM

top - 00:12:05 up 10:23, 4 users, load average: 0.35, 0.61, 0.66
Tasks: 346 total, 1 running, 345 sleeping, 0 stopped, 0 zombie
%Cpu(s): 20.6 us, 2.9 sy, 0.0 ni, 76.5 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 32753116 total, 1031864 free, 26521596 used, 5199656 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 5721816 avail Mem

The VM is all-in-one setup and looks under resourced. Have you been running all-in-one on this setup earlier?

vimal (vappachan) wrote :

This issue is seen in 2 testbeds. We were running sanity without any issues on these 2 testbeds.

Santosh Gupta (sangupta) wrote :

Services look good in the container.
contrail-status error is always showing for config_database_cassandra_1/config_database_zookeeper_1
Assigning to Andrey to check if contrail-status needs fix for the new roles for config cassandra/zookeeper.

Andrey Pavlov (apavlov-e) wrote :

please provide full info about setup - I see that containers 5.0-137. which version of ansible-deployer you are using? how much memory/cpu/disk it has?

@Santosh, new nodemgr is present in build 5.0-18? and above.

@Sudhee, @Vimal - without JVM_EXTRA_OPT this all-in-one VM can be over-resourced. you can set this option at least for configdb.

Sudheendra Rao (sudheendra-k) wrote :

removing sanityblocker as problem is not seen in the recent build, but will monitor the bug for few more builds before closing.

tags: added: sanity
removed: sanityblocker
Sudheendra Rao (sudheendra-k) wrote :

problem was due to partial commit of the bug 1765487, the problem is not seen after this bug is fixed, hence closing the bug.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers