This is an intermittent issue. Upgrade from 2.21.2-Build36 to 3.1.1.0-Build45 fails some times while executing while executing 'nodetool upgradesstables'.
fab upgrade_contrail fails with the following error messages.
--------------------------------------------------------------------
2016-11-28 12:51:22:799954: [root@10.0.0.201] out: nodetool: Failed to connect to '127.0.0.1:7199' - ConnectException: 'Connection refused'.
2016-11-28 12:51:23:239768: [root@10.0.0.201] out:
2016-11-28 12:51:23:255387: [root@10.0.0.201] out: Fatal error: local() encountered an error (return code 1) while executing 'nodetool upgradesstables'
2016-11-28 12:51:23:255470: [root@10.0.0.201] out:
2016-11-28 12:51:23:255546: [root@10.0.0.201] out: Aborting.
2016-11-28 12:51:23:255610: [root@10.0.0.201] out:
2016-11-28 12:51:23:319930:
2016-11-28 12:51:23:323949: Fatal error: sudo() received nonzero return code 1 while executing!
2016-11-28 12:51:23:323949:
2016-11-28 12:51:23:323949: Requested: upgrade-vnc-database --self_ip 192.168.0.201 --cfgm_ip 192.168.0.200 --opscenter_ip 10.0.0.201 --seed_list 192.168.0.201,192.168.0.202 --zookeeper_ip_list 192.168.0.201 192.168.0.202 192.168.0.203 --database_index 1 --minimum_diskGB 256 --kafka_broker_id 0 -P contrail-openstack-database -F 2.21.2 -T 3.1.1.0
2016-11-28 12:51:23:323949: Executed: sudo -S -p 'sudo password:' /bin/bash -l -c "upgrade-vnc-database --self_ip 192.168.0.201 --cfgm_ip 192.168.0.200 --opscenter_ip 10.0.0.201 --seed_list 192.168.0.201,192.168.0.202 --zookeeper_ip_list 192.168.0.201 192.168.0.202 192.168.0.203 --database_index 1 --minimum_diskGB 256 --kafka_broker_id 0 -P contrail-openstack-database -F 2.21.2 -T 3.1.1.0"
2016-11-28 12:51:23:323949:
--------------------------------------------------------------------
After the failure, we checked the status of cassandra on failed node (sv-1). The process was running and it can be accessed with cqlsh. However, it wasn't bound to JMX port (7199).
--------------------------------------------------------------------
root@sv-1:~# ps -ef | grep cassandra
cassand+ 11311 1 25 13:18 ? 00:00:23 java -ea -javaagent:/usr/share/cassandra/lib/jamm-0.3.0.jar -XX:+CMSClassUnloadingEnabled -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms1996M -Xmx1996M -Xmn400M -XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=1000003 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseTLAB -XX:+PerfDisableSharedMem -XX:CompileCommandFile=/etc/cassandra/hotspot_compiler -XX:CMSWaitDuration=10000 -XX:+CMSParallelInitialMarkEnabled -XX:+CMSEdenChunksRecordAlways -XX:CMSWaitDuration=10000 -XX:+UseCondCardMark -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintHeapAtGC -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime -XX:+PrintPromotionFailure -Xloggc:/var/log/cassandra/gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=10M -Djava.net.preferIPv4Stack=true -Dcassandra.jmx.local.port=7199 -XX:+DisableExplicitGC -Djava.library.path=/usr/share/cassandra/lib/sigar-bin -Dcassandra.libjemalloc=/usr/lib/x86_64-linux-gnu/libjemalloc.so.1 -Dlogback.configurationFile=logback.xml -Dcassandra.logdir=/var/log/cassandra -Dcassandra.storagedir= -Dcassandra-pidfile=/var/run/cassandra/cassandra.pid -cp /etc/cassandra:/usr/share/cassandra/lib/ST4-4.0.8.jar:/usr/share/cassandra/lib/airline-0.6.jar:/usr/share/cassandra/lib/antlr-runtime-3.5.2.jar:/usr/share/cassandra/lib/cassandra-driver-core-2.2.0-rc2-SNAPSHOT-20150617-shaded.jar:/usr/share/cassandra/lib/commons-cli-1.1.jar:/usr/share/cassandra/lib/commons-codec-1.2.jar:/usr/share/cassandra/lib/commons-lang3-3.1.jar:/usr/share/cassandra/lib/commons-math3-3.2.jar:/usr/share/cassandra/lib/compress-lzf-0.8.4.jar:/usr/share/cassandra/lib/concurrentlinkedhashmap-lru-1.4.jar:/usr/share/cassandra/lib/crc32ex-0.1.1.jar:/usr/share/cassandra/lib/disruptor-3.0.1.jar:/usr/share/cassandra/lib/ecj-4.4.2.jar:/usr/share/cassandra/lib/guava-16.0.jar:/usr/share/cassandra/lib/high-scale-lib-1.0.6.jar:/usr/share/cassandra/lib/jackson-core-asl-1.9.2.jar:/usr/share/cassandra/lib/jackson-mapper-asl-1.9.2.jar:/usr/share/cassandra/lib/jamm-0.3.0.jar:/usr/share/cassandra/lib/javax.inject.jar:/usr/share/cassandra/lib/jbcrypt-0.3m.jar:/usr/share/cassandra/lib/jcl-over-slf4j-1.7.7.jar:/usr/share/cassandra/lib/jna-4.0.0.jar:/usr/share/cassandra/lib/joda-time-2.4.jar:/usr/share/cassandra/lib/json-simple-1.1.jar:/usr/share/cassandra/lib/libthrift-0.9.2.jar:/usr/share/cassandra/lib/log4j-over-slf4j-1.7.7.jar:/usr/share/cassandra/lib/logback-classic-1.1.3.jar:/usr/share/cassandra/lib/logback-core-1.1.3.jar:/usr/share/cassandra/lib/lz4-1.3.0.jar:/usr/share/cassandra/lib/metrics-core-3.1.0.jar:/usr/share/cassandra/lib/metrics-logback-3.1.0.jar:/usr/share/cassandra/lib/netty-all-4.0.23.Final.jar:/usr/share/cassandra/lib/ohc-core-0.3.4.jar:/usr/share/cassandra/lib/ohc-core-j8-0.3.4.jar:/usr/share/cassandra/lib/reporter-config-base-3.0.0.jar:/usr/share/cassandra/lib/reporter-config3-3.0.0.jar:/usr/share/cassandra/lib/sigar-1.6.4.jar:/usr/share/cassandra/lib/slf4j-api-1.7.7.jar:/usr/share/cassandra/lib/snakeyaml-1.11.jar:/usr/share/cassandra/lib/snappy-java-1.1.1.7.jar:/usr/share/cassandra/lib/stream-2.5.2.jar:/usr/share/cassandra/lib/super-csv-2.1.0.jar:/usr/share/cassandra/lib/thrift-server-0.3.7.jar:/usr/share/cassandra/apache-cassandra-2.2.5.jar:/usr/share/cassandra/apache-cassandra-thrift-2.2.5.jar:/usr/share/cassandra/apache-cassandra.jar:/usr/share/cassandra/stress.jar: -XX:HeapDumpPath=/var/lib/cassandra/java_1480306721.hprof -XX:ErrorFile=/var/lib/cassandra/hs_err_1480306721.log org.apache.cassandra.service.CassandraDaemon
root 13398 13240 0 13:20 pts/0 00:00:00 grep --color=auto cassandra
root@sv-1:~# netstat -natp ~ | grep 7199
root@sv-1:~#
root@sv-1:~# nodetool status
nodetool: Failed to connect to '127.0.0.1:7199' - ConnectException: 'Connection refused'.
root@sv-1:~#
root@sv-1:~# cqlsh 192.168.0.201 -e exit
Connected to Contrail at 192.168.0.201:9042.
[cqlsh 5.0.1 | Cassandra 2.2.5 | CQL spec 3.3.1 | Native protocol v4]
Use HELP for help.
cqlsh> exit
root@sv-1:~#
The customer thinks it is due to frequent start/stop of cassandra service and insufficient
status confirmation in upgrade-vnc-database.
https://github.com/Juniper/contrail-provisioning/blob/master/contrail_provisioning/database/base.py#L176
https://github.com/Juniper/contrail-provisioning/blob/master/contrail_provisioning/database/migrate.py#L52
When the code is modified in migrate.py to stop cassandra service with retries and check the binding of JMX port before running nodetool upgradesstables, this issue is not seen.
Please add some more validation in migrate.py to mitigate this.
"When the code is modified in migrate.py..."
Can you attach the diffs used for this.. We will evaluate for correctness and commit the code..