Vcenter-only: upgrade from 2.21.2 to 3.1.0.0 (build#14) failed Analytics/collector

Bug #1612545 reported by Sarath
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R3.1
Fix Committed
High
Ranjeet R
Trunk
Fix Committed
High
Ranjeet R

Bug Description

upgrade from 2.21.2 to 3.1.0.0 (build#14) failed Analytics/collector. This is 3 controller nodes setup and seeing this failed on all 3 nodes and not coming up when doing service restart.

root@oblocknode02:~#
root@oblocknode02:~# contrail-status
ssh root@172.16.80.105
== Contrail Control ==
supervisor-control: active
contrail-control active
contrail-control-nodemgr active
contrail-dns active
contrail-named active

== Contrail Analytics ==
supervisor-analytics: active
contrail-alarm-gen active
contrail-analytics-api initializing (UvePartitions:UVE-Aggregation[Partitions:0] connection down)
contrail-analytics-nodemgr active
contrail-collector initializing (KafkaPub:172.16.80.2:9092,172.16.80.13:9092,172.16.80.4:9092 connection down)
contrail-query-engine active
contrail-snmp-collector active
contrail-topology active

== Contrail Config ==
supervisor-config: active
contrail-api:0 active
contrail-config-nodemgr active
contrail-device-manager backup
contrail-discovery:0 active
contrail-schema backup
contrail-svc-monitor backup
contrail-vcenter-plugin active
ifmap active

== Contrail Web UI ==
supervisor-webui: active
contrail-webui active
contrail-webui-middleware active

== Contrail Database ==
contrail-database: active

== Contrail Supervisor Database ==
supervisor-database: active
contrail-database-nodemgr active
kafka active

== Contrail Support Services ==
supervisor-support-service: active
rabbitmq-server active

========Run time service failures=============
/var/crashes/core.contrail-collec.21014.oblocknode02.1470962323
/var/crashes/core.contrail-collec.1525.oblocknode02.1470958245

[New LWP 1525]
[New LWP 1622]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/bin/contrail-collector'.
Program terminated with signal SIGABRT, Aborted.
#0 0x00007f4a849d2cc9 in raise () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0 0x00007f4a849d2cc9 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f4a849d60d8 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007f4a849cbb86 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3 0x00007f4a849cbc32 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4 0x0000000000601a68 in ?? ()
#5 0x00000000005a7ab1 in ?? ()
#6 0x000000000059611d in ?? ()
#7 0x0000000000744120 in ?? ()
#8 0x0000000000742eae in ?? ()
#9 0x0000000000743615 in ?? ()
#10 0x00000000007411ab in ?? ()
#11 0x0000000000739085 in ?? ()
#12 0x00000000007402a7 in ?? ()
#13 0x000000000046a2bf in ?? ()
#14 0x00007f4a85f5ab3a in ?? () from /usr/lib/libtbb.so.2
#15 0x00007f4a85f56816 in ?? () from /usr/lib/libtbb.so.2
#16 0x00007f4a85f55f4b in ?? () from /usr/lib/libtbb.so.2
#17 0x00007f4a85f520ff in ?? () from /usr/lib/libtbb.so.2
#18 0x00007f4a85f522f9 in ?? () from /usr/lib/libtbb.so.2
#19 0x00007f4a86176182 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#20 0x00007f4a84a9647d in clone () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) exit

Revision history for this message
Sarath (nsarath) wrote :

nsarath@ubuntu-build04:/cs-shared/cores/1612545$ pwd
/cs-shared/cores/1612545
nsarath@ubuntu-build04:/cs-shared/cores/1612545$ ls -l
total 2979776
-rwxrwxrwx 1 nsarath test 540508160 Aug 12 01:57 core.contrail-collec.1525.oblocknode02.1470958245
-rwxrwxrwx 1 nsarath test 444080128 Aug 12 01:56 core.contrail-collec.21014.oblocknode02.1470962323
-rwxrwxrwx 1 nsarath test 770426880 Aug 12 01:49 Ctrl-A-log.tar
-rwxrwxrwx 1 nsarath test 644382720 Aug 12 01:49 Ctrl-B-log.tar
-rwxrwxrwx 1 nsarath test 548812800 Aug 12 01:49 Ctrl-C-log.tar
-rwxrwxrwx 1 nsarath test 18186240 Aug 12 01:49 Vrtr-0-log.tar
-rwxrwxrwx 1 nsarath test 18206720 Aug 12 01:52 Vrtr-1-log.tar
-rwxrwxrwx 1 nsarath test 18216960 Aug 12 01:53 Vrtr-3-log.tar
-rwxrwxrwx 1 nsarath test 18237440 Aug 12 01:53 Vrtr-5-log.tar
-rwxrwxrwx 1 nsarath test 18206720 Aug 12 01:53 Vrtr-7-log.tar
nsarath@ubuntu-build04:/cs-shared/cores/1612545$

description: updated
Revision history for this message
Raj Reddy (rajreddy) wrote :

+Ignatious

It may be true in other places also.. most often what we do for setup, we have to do it for upgrade also..

-Raj

On Aug 12, 2016, at 8:30 AM, Raj Reddy <email address hidden> wrote:

Hi Nikhil,

Thanks for the analysis..

I guess the statement should have been

if (parent_cmd == "setup-vnc-database” or parent_cmd == “upgrade-vnc-database”) and get_kafka_enabled() is not None:
        cmd += " --kafka_broker_id %d" % broker_id

Ranjeet, can you please fix it..

thanks,
-Raj

On Aug 12, 2016, at 4:05 AM, Nikhil Bansal <email address hidden> wrote:

I looked into the code and it seems that broker_id is not being passed as an argument due to a change in fabfile/utils/commandline.py:
https://review.opencontrail.org/#/c/18040/

Now we pass broker id only in this case:
    if parent_cmd == "setup-vnc-database" and get_kafka_enabled() is not None:
        cmd += " --kafka_broker_id %d" % broker_id

For upgrade case, we are not passing kafka broker id. I am not sure about above mentioned change so maybe Raj can take it forward now. I am also including the committer of this code to get better understanding.

Thanks,
Nikhil

From: Nikhil Bansal <email address hidden>
Date: Friday, August 12, 2016 at 2:50 PM
To: Sarathbabu Narasimhan <email address hidden>, Raj Reddy <email address hidden>
Subject: Re: Bug #1612545 : Vcenter-only: upgrade from 2.21.2 to 3.1.0.0 (build#14) failed Analytics/collector

It seems that kafka config is out of sync. Every node has got brokerid of 0 which is causing kafka to go down:

java.lang.RuntimeException: A broker is already registered on the path /brokers/ids/0. This probably indicates that you either have configured a brokerid that is already in use, or else you have shutdown this broker and restarted it faster than the zookeeper timeout so it appears to be re-registering.

Corresponding config files have brokerid of 0 on all the nodes. I am not much familiar with the details of upgrade in kafka so will be looking into the code to figure out possible reason.

Thanks,
Nikhil
PS: we can recover from it by deleting zookeeper ephemeral node.

From: Sarathbabu Narasimhan <email address hidden>
Date: Friday, August 12, 2016 at 2:16 PM
To: Raj Reddy <email address hidden>, Nikhil Bansal <email address hidden>
Cc: Sarathbabu Narasimhan <email address hidden>
Subject: Bug #1612545 : Vcenter-only: upgrade from 2.21.2 to 3.1.0.0 (build#14) failed Analytics/collector

Hi Raj/Nikhil,

Bug #1612545 : Vcenter-only: upgrade from 2.21.2 to 3.1.0.0 (build#14) failed Analytics/collector

As this issue not recovering inspite service restart and Ashish mentioned we should support 2.21 upgrade.
I kept the setup in problem state for triaging, please find below,
10.87.26.197 / .208 / .199

Thanks
*Sarath

Changed in juniperopenstack:
assignee: Raj Reddy (rajreddy) → Ranjeet R (rranjeet-n)
Revision history for this message
Ignatious Johnson Christopher (ijohnson-x) wrote :

Raj/Nikhil,

We don’t do re-provisioning of config files in other roles during upgrade, only in analytics/database we do that.
In other roles we add/remove the changed config bits in the config files.

So Ranjeet would have assumed that analytics upgrade is same as other roles.

We need to audit analytics/database upgrade.py and remove unnecessary re-provisioing of config files during upgrade.

Thanks,
Ignatious

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R3.1

Review in progress for https://review.opencontrail.org/23258
Submitter: Ignatious Johnson Christopher (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/23258
Committed: http://github.org/Juniper/contrail-fabric-utils/commit/9038c00dd0ff4b942596c4b2fb5c263bce7ec2f5
Submitter: Zuul
Branch: R3.1

commit 9038c00dd0ff4b942596c4b2fb5c263bce7ec2f5
Author: Ignatious Johnson Christopher <email address hidden>
Date: Fri Aug 12 10:58:10 2016 -0700

Pass kafka broker id in upgrade commandline aswell,
as database upgrade reprovision's cassandra/kafka config files.

Change-Id: Ie44bde1dc75341a82f93b8bafe108b6e15acc248
Closes-Bug: 1612545

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] master

Review in progress for https://review.opencontrail.org/23290
Submitter: Ignatious Johnson Christopher (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/23290
Committed: http://github.org/Juniper/contrail-fabric-utils/commit/2f2671527e0a3f6ca08fcdd1c410f1aff012441c
Submitter: Zuul
Branch: master

commit 2f2671527e0a3f6ca08fcdd1c410f1aff012441c
Author: Ignatious Johnson Christopher <email address hidden>
Date: Fri Aug 12 10:58:10 2016 -0700

Pass kafka broker id in upgrade commandline aswell,
as database upgrade reprovision's cassandra/kafka config files.

Change-Id: Ie44bde1dc75341a82f93b8bafe108b6e15acc248
Closes-Bug: 1612545
(cherry picked from commit 9038c00dd0ff4b942596c4b2fb5c263bce7ec2f5)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.