Upgrade from 2.21.2b36 to 3.1.1.0-45 fails some times due to upgrade-vnc-database.

Bug #1645250 reported by Sandeep Sridhar
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Juniper Openstack
New
Medium
Unassigned
R3.1
New
Medium
Unassigned
R3.2
New
Medium
Unassigned

Bug Description

This is an intermittent issue. Upgrade from 2.21.2-Build36 to 3.1.1.0-Build45 fails some times while executing while executing 'nodetool upgradesstables'.

fab upgrade_contrail fails with the following error messages.

--------------------------------------------------------------------
2016-11-28 12:51:22:799954: [root@10.0.0.201] out: nodetool: Failed to connect to '127.0.0.1:7199' - ConnectException: 'Connection refused'.
2016-11-28 12:51:23:239768: [root@10.0.0.201] out:
2016-11-28 12:51:23:255387: [root@10.0.0.201] out: Fatal error: local() encountered an error (return code 1) while executing 'nodetool upgradesstables'
2016-11-28 12:51:23:255470: [root@10.0.0.201] out:
2016-11-28 12:51:23:255546: [root@10.0.0.201] out: Aborting.
2016-11-28 12:51:23:255610: [root@10.0.0.201] out:
2016-11-28 12:51:23:319930:

2016-11-28 12:51:23:323949: Fatal error: sudo() received nonzero return code 1 while executing!
2016-11-28 12:51:23:323949:
2016-11-28 12:51:23:323949: Requested: upgrade-vnc-database --self_ip 192.168.0.201 --cfgm_ip 192.168.0.200 --opscenter_ip 10.0.0.201 --seed_list 192.168.0.201,192.168.0.202 --zookeeper_ip_list 192.168.0.201 192.168.0.202 192.168.0.203 --database_index 1 --minimum_diskGB 256 --kafka_broker_id 0 -P contrail-openstack-database -F 2.21.2 -T 3.1.1.0
2016-11-28 12:51:23:323949: Executed: sudo -S -p 'sudo password:' /bin/bash -l -c "upgrade-vnc-database --self_ip 192.168.0.201 --cfgm_ip 192.168.0.200 --opscenter_ip 10.0.0.201 --seed_list 192.168.0.201,192.168.0.202 --zookeeper_ip_list 192.168.0.201 192.168.0.202 192.168.0.203 --database_index 1 --minimum_diskGB 256 --kafka_broker_id 0 -P contrail-openstack-database -F 2.21.2 -T 3.1.1.0"
2016-11-28 12:51:23:323949:
--------------------------------------------------------------------

After the failure, we checked the status of cassandra on failed node (sv-1). The process was running and it can be accessed with cqlsh. However, it wasn't bound to JMX port (7199).

--------------------------------------------------------------------
root@sv-1:~# ps -ef | grep cassandra
cassand+ 11311 1 25 13:18 ? 00:00:23 java -ea -javaagent:/usr/share/cassandra/lib/jamm-0.3.0.jar -XX:+CMSClassUnloadingEnabled -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms1996M -Xmx1996M -Xmn400M -XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=1000003 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseTLAB -XX:+PerfDisableSharedMem -XX:CompileCommandFile=/etc/cassandra/hotspot_compiler -XX:CMSWaitDuration=10000 -XX:+CMSParallelInitialMarkEnabled -XX:+CMSEdenChunksRecordAlways -XX:CMSWaitDuration=10000 -XX:+UseCondCardMark -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintHeapAtGC -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime -XX:+PrintPromotionFailure -Xloggc:/var/log/cassandra/gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=10M -Djava.net.preferIPv4Stack=true -Dcassandra.jmx.local.port=7199 -XX:+DisableExplicitGC -Djava.library.path=/usr/share/cassandra/lib/sigar-bin -Dcassandra.libjemalloc=/usr/lib/x86_64-linux-gnu/libjemalloc.so.1 -Dlogback.configurationFile=logback.xml -Dcassandra.logdir=/var/log/cassandra -Dcassandra.storagedir= -Dcassandra-pidfile=/var/run/cassandra/cassandra.pid -cp /etc/cassandra:/usr/share/cassandra/lib/ST4-4.0.8.jar:/usr/share/cassandra/lib/airline-0.6.jar:/usr/share/cassandra/lib/antlr-runtime-3.5.2.jar:/usr/share/cassandra/lib/cassandra-driver-core-2.2.0-rc2-SNAPSHOT-20150617-shaded.jar:/usr/share/cassandra/lib/commons-cli-1.1.jar:/usr/share/cassandra/lib/commons-codec-1.2.jar:/usr/share/cassandra/lib/commons-lang3-3.1.jar:/usr/share/cassandra/lib/commons-math3-3.2.jar:/usr/share/cassandra/lib/compress-lzf-0.8.4.jar:/usr/share/cassandra/lib/concurrentlinkedhashmap-lru-1.4.jar:/usr/share/cassandra/lib/crc32ex-0.1.1.jar:/usr/share/cassandra/lib/disruptor-3.0.1.jar:/usr/share/cassandra/lib/ecj-4.4.2.jar:/usr/share/cassandra/lib/guava-16.0.jar:/usr/share/cassandra/lib/high-scale-lib-1.0.6.jar:/usr/share/cassandra/lib/jackson-core-asl-1.9.2.jar:/usr/share/cassandra/lib/jackson-mapper-asl-1.9.2.jar:/usr/share/cassandra/lib/jamm-0.3.0.jar:/usr/share/cassandra/lib/javax.inject.jar:/usr/share/cassandra/lib/jbcrypt-0.3m.jar:/usr/share/cassandra/lib/jcl-over-slf4j-1.7.7.jar:/usr/share/cassandra/lib/jna-4.0.0.jar:/usr/share/cassandra/lib/joda-time-2.4.jar:/usr/share/cassandra/lib/json-simple-1.1.jar:/usr/share/cassandra/lib/libthrift-0.9.2.jar:/usr/share/cassandra/lib/log4j-over-slf4j-1.7.7.jar:/usr/share/cassandra/lib/logback-classic-1.1.3.jar:/usr/share/cassandra/lib/logback-core-1.1.3.jar:/usr/share/cassandra/lib/lz4-1.3.0.jar:/usr/share/cassandra/lib/metrics-core-3.1.0.jar:/usr/share/cassandra/lib/metrics-logback-3.1.0.jar:/usr/share/cassandra/lib/netty-all-4.0.23.Final.jar:/usr/share/cassandra/lib/ohc-core-0.3.4.jar:/usr/share/cassandra/lib/ohc-core-j8-0.3.4.jar:/usr/share/cassandra/lib/reporter-config-base-3.0.0.jar:/usr/share/cassandra/lib/reporter-config3-3.0.0.jar:/usr/share/cassandra/lib/sigar-1.6.4.jar:/usr/share/cassandra/lib/slf4j-api-1.7.7.jar:/usr/share/cassandra/lib/snakeyaml-1.11.jar:/usr/share/cassandra/lib/snappy-java-1.1.1.7.jar:/usr/share/cassandra/lib/stream-2.5.2.jar:/usr/share/cassandra/lib/super-csv-2.1.0.jar:/usr/share/cassandra/lib/thrift-server-0.3.7.jar:/usr/share/cassandra/apache-cassandra-2.2.5.jar:/usr/share/cassandra/apache-cassandra-thrift-2.2.5.jar:/usr/share/cassandra/apache-cassandra.jar:/usr/share/cassandra/stress.jar: -XX:HeapDumpPath=/var/lib/cassandra/java_1480306721.hprof -XX:ErrorFile=/var/lib/cassandra/hs_err_1480306721.log org.apache.cassandra.service.CassandraDaemon
root 13398 13240 0 13:20 pts/0 00:00:00 grep --color=auto cassandra

root@sv-1:~# netstat -natp ~ | grep 7199
root@sv-1:~#

root@sv-1:~# nodetool status
nodetool: Failed to connect to '127.0.0.1:7199' - ConnectException: 'Connection refused'.

root@sv-1:~#
root@sv-1:~# cqlsh 192.168.0.201 -e exit
Connected to Contrail at 192.168.0.201:9042.
[cqlsh 5.0.1 | Cassandra 2.2.5 | CQL spec 3.3.1 | Native protocol v4]
Use HELP for help.
cqlsh> exit

root@sv-1:~#

The customer thinks it is due to frequent start/stop of cassandra service and insufficient
status confirmation in upgrade-vnc-database.

https://github.com/Juniper/contrail-provisioning/blob/master/contrail_provisioning/database/base.py#L176
https://github.com/Juniper/contrail-provisioning/blob/master/contrail_provisioning/database/migrate.py#L52

When the code is modified in migrate.py to stop cassandra service with retries and check the binding of JMX port before running nodetool upgradesstables, this issue is not seen.

Please add some more validation in migrate.py to mitigate this.

Tags: analytics
information type: Proprietary → Public
Revision history for this message
Raj Reddy (rajreddy) wrote :

"When the code is modified in migrate.py..."

Can you attach the diffs used for this.. We will evaluate for correctness and commit the code..

Jeba Paulaiyan (jebap)
Changed in juniperopenstack:
importance: Undecided → Medium
Revision history for this message
Sandeep Sridhar (ssandeep) wrote :

Hi Raj, These are the changes the customer did to migrate.py

-------------------------------------------------------------------------------------------
root@sv-1:~# diff -up /usr/local/lib/python2.7/dist-packages/contrail_provisioning/database/migrate.py.org /usr/local/lib/python2.7/dist-packages/contrail_provisioning/database/migrate.py
--- /usr/local/lib/python2.7/dist-packages/contrail_provisioning/database/migrate.py.org 2016-11-29 15:54:43.733510099 +0900
+++ /usr/local/lib/python2.7/dist-packages/contrail_provisioning/database/migrate.py 2016-11-29 15:54:44.825510017 +0900
@@ -50,7 +50,12 @@ class DatabaseMigrate(DatabaseCommon):
self._args = parser.parse_args(self.remaining_argv)

def stop_cassandra(self):
- local('service cassandra stop')
+ with settings(warn_only=True):
+ while True:
+ local("service cassandra stop; sleep 5")
+ result = local("ps -ef | grep cassandr[a]")
+ if result.failed:
+ break

def force_stop_cassandra(self):
local('kill `ps auxw | grep -E "Dcassandra-pidfile=.*cassandra\.pid" | grep -v grep | awk \'{print $2}\'`')
@@ -59,6 +64,13 @@ class DatabaseMigrate(DatabaseCommon):
local('service contrail-database stop')

def upgrade_sstables_and_drain(self):
+ with settings(warn_only=True):
+ while True:
+ result = local("nodetool status > /dev/null")
+ if result.succeeded:
+ break
+ else:
+ local("sleep 5")
print 'Upgrading database sstables...'
local('nodetool upgradesstables')
local('nodetool drain')
-------------------------------------------------------------------------------------------

Revision history for this message
Raj Reddy (rajreddy) wrote :

we added checking 'nodetool status' while checking for cassandra up using
https://launchpad.net/bugs/1646945

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.