Bug #1706832 “3.1.3-75:Potential bug in restore_cassandra_db” : Bugs : Juniper Openstack

Sandeep Sridhar (ssandeep) on 2017-07-27

information type:	Proprietary → Public
Changed in juniperopenstack:
importance:	Undecided → High
assignee:	nobody → Megh Bhatt (meghb)
milestone:	none → r3.1.4.0

Jeba Paulaiyan (jebap) on 2017-07-27

tags:

added: analytics

Revision history for this message

Sandeep Sridhar (ssandeep) wrote on 2017-07-31:

#1

Analytics Team - We need to speed up the progress on this. Can I please get an update?

-Sandeep.

Revision history for this message

Megh Bhatt (meghb) wrote on 2017-08-03:

#2

Download full text (6.2 KiB)

Looked at the customer setup:

The issue seems to be that "service supervisor-analytics stop" command issued by fab stop_collector is hanging. "service supervisor-analytics stop" in turn calls "supervisorctl -s unix:///var/run/supervisord_analytics.sock stop all" and that command hangs. The reason for the hang seems to be that once the fab stop_database is run, contrail-alarm-gen loses connection to kafka and it then outputs a lot of messages to stdout

FailedPayloadsError for -uve-19:0
Connect attempt to <BrokerConnection host=192.168.0.138 port=9092> returned error 111. Disconnecting.
Skipping unconnected connection: <BrokerConnection host=192.168.0.138 port=9092>
Connect attempt to <BrokerConnection host=192.168.0.137 port=9092> returned error 111. Disconnecting.
Skipping unconnected connection: <BrokerConnection host=192.168.0.137 port=9092>
Connect attempt to <BrokerConnection host=192.168.0.139 port=9092> returned error 111. Disconnecting.
Skipping unconnected connection: <BrokerConnection host=192.168.0.139 port=9092>
FailedPayloadsError for -uve-19:0
Connect attempt to <BrokerConnection host=192.168.0.138 port=9092> returned error 111. Disconnecting.
Skipping unconnected connection: <BrokerConnection host=192.168.0.138 port=9092>
Connect attempt to <BrokerConnection host=192.168.0.137 port=9092> returned error 111. Disconnecting.
Skipping unconnected connection: <BrokerConnection host=192.168.0.137 port=9092>
Connect attempt to <BrokerConnection host=192.168.0.139 port=9092> returned error 111. Disconnecting.
Skipping unconnected connection: <BrokerConnection host=192.168.0.139 port=9092>
FailedPayloadsError for -uve-19:0
Connect attempt to <BrokerConnection host=192.168.0.139 port=9092> returned error 111. Disconnecting.
Skipping unconnected connection: <BrokerConnection host=192.168.0.139 port=9092>
Connect attempt to <BrokerConnection host=192.168.0.138 port=9092> returned error 111. Disconnecting.
Skipping unconnected connection: <BrokerConnection host=192.168.0.138 port=9092>
Connect attempt to <BrokerConnection host=192.168.0.137 port=9092> returned error 111. Disconnecting.
Skipping unconnected connection: <BrokerConnection host=192.168.0.137 port=9092>
FailedPayloadsError for -uve-19:0
Connect attempt to <BrokerConnection host=192.168.0.137 port=9092> returned error 111. Disconnecting.
Skipping unconnected connection: <BrokerConnection host=192.168.0.137 port=9092>
Connect attempt to <BrokerConnection host=192.168.0.139 port=9092> returned error 111. Disconnecting.
Skipping unconnected connection: <BrokerConnection host=192.168.0.139 port=9092>
Connect attempt to <BrokerConnection host=192.168.0.138 port=9092> returned error 111. Disconnecting.
Skipping unconnected connection: <BrokerConnection host=192.168.0.138 port=9092>
FailedPayloadsError for -uve-19:0
Connect attempt to <BrokerConnection host=192.168.0.138 port=9092> returned error 111. Disconnecting.
Skipping unconnected connection: <BrokerConnection host=192.168.0.138 port=9092>
Connect attempt to <BrokerConnection host=192.168.0.139 port=9092> returned error 111. Disconnecting.
Skipping unconnected connection: <BrokerConnection host=192.168.0.139 port=9092>
Connect at...

Looked at the customer setup:

The issue seems to be that "service supervisor-analytics stop" command issued by fab stop_collector is hanging. "service supervisor-analytics stop" in turn calls "supervisorctl -s unix:///var/run/supervisord_analytics.sock stop all" and that command hangs. The reason for the hang seems to be that once the fab stop_database is run, contrail-alarm-gen loses connection to kafka and it then outputs a lot of messages to stdout

FailedPayloadsError for -uve-19:0
Connect attempt to <BrokerConnection host=192.168.0.138 port=9092> returned error 111. Disconnecting.
Skipping unconnected connection: <BrokerConnection host=192.168.0.138 port=9092>
Connect attempt to <BrokerConnection host=192.168.0.137 port=9092> returned error 111. Disconnecting.
Skipping unconnected connection: <BrokerConnection host=192.168.0.137 port=9092>
Connect attempt to <BrokerConnection host=192.168.0.139 port=9092> returned error 111. Disconnecting.
Skipping unconnected connection: <BrokerConnection host=192.168.0.139 port=9092>
FailedPayloadsError for -uve-19:0
Connect attempt to <BrokerConnection host=192.168.0.138 port=9092> returned error 111. Disconnecting.
Skipping unconnected connection: <BrokerConnection host=192.168.0.138 port=9092>
Connect attempt to <BrokerConnection host=192.168.0.137 port=9092> returned error 111. Disconnecting.
Skipping unconnected connection: <BrokerConnection host=192.168.0.137 port=9092>
Connect attempt to <BrokerConnection host=192.168.0.139 port=9092> returned error 111. Disconnecting.
Skipping unconnected connection: <BrokerConnection host=192.168.0.139 port=9092>
FailedPayloadsError for -uve-19:0
Connect attempt to <BrokerConnection host=192.168.0.139 port=9092> returned error 111. Disconnecting.
Skipping unconnected connection: <BrokerConnection host=192.168.0.139 port=9092>
Connect attempt to <BrokerConnection host=192.168.0.138 port=9092> returned error 111. Disconnecting.
Skipping unconnected connection: <BrokerConnection host=192.168.0.138 port=9092>
Connect attempt to <BrokerConnection host=192.168.0.137 port=9092> returned error 111. Disconnecting.
Skipping unconnected connection: <BrokerConnection host=192.168.0.137 port=9092>
FailedPayloadsError for -uve-19:0
Connect attempt to <BrokerConnection host=192.168.0.137 port=9092> returned error 111. Disconnecting.
Skipping unconnected connection: <BrokerConnection host=192.168.0.137 port=9092>
Connect attempt to <BrokerConnection host=192.168.0.139 port=9092> returned error 111. Disconnecting.
Skipping unconnected connection: <BrokerConnection host=192.168.0.139 port=9092>
Connect attempt to <BrokerConnection host=192.168.0.138 port=9092> returned error 111. Disconnecting.
Skipping unconnected connection: <BrokerConnection host=192.168.0.138 port=9092>
FailedPayloadsError for -uve-19:0
Connect attempt to <BrokerConnection host=192.168.0.138 port=9092> returned error 111. Disconnecting.
Skipping unconnected connection: <BrokerConnection host=192.168.0.138 port=9092>
Connect attempt to <BrokerConnection host=192.168.0.139 port=9092> returned error 111. Disconnecting.
Skipping unconnected connection: <BrokerConnection host=192.168.0.139 port=9092>
Connect attempt to <BrokerConnection host=192.168.0.137 port=9092> returned error 111. Disconnecting.
Skipping unconnected connection: <BrokerConnection host=192.168.0.137 port=9092>
FailedPayloadsError for -uve-19:0

supervisord gets busy in reading those and is not able to process the stop all command. This was confirmed by removing /var/log/contrail/contrail-alarm-gen-0-stdout.log and then observing that fab stop_collector moved on to the next node.

root@sv-35:~# echo > /var/log/contrail/contrail-alarm-gen-0-stdout.log
root@sv-35:~# ls -lart /var/log/contrail/contrail-alarm-gen-0-stdout.log
-rw-r--r-- 1 root root 18783175 Aug  3 14:24 /var/log/contrail/contrail-alarm-gen-0-stdout.log
root@sv-35:~# date
Thu Aug  3 14:24:42 JST 2017
root@sv-35:~# ls -lart /var/log/contrail/contrail-alarm-gen-0-stdout.log
-rw-r--r-- 1 root root 21450583 Aug  3 14:24 /var/log/contrail/contrail-alarm-gen-0-stdout.log
root@sv-35:~# ls -lart /var/log/contrail/contrail-alarm-gen-0-stdout.log
-rw-r--r-- 1 root root 22888661 Aug  3 14:24 /var/log/contrail/contrail-alarm-gen-0-stdout.log
root@sv-35:~# ls -lart /var/log/contrail/contrail-alarm-gen-0-stdout.log
-rw-r--r-- 1 root root 23655489 Aug  3 14:24 /var/log/contrail/contrail-alarm-gen-0-stdout.log
root@sv-35:~# ls -lart /var/log/contrail/contrail-alarm-gen-0-stdout.log
-rw-r--r-- 1 root root 24355706 Aug  3 14:24 /var/log/contrail/contrail-alarm-gen-0-stdout.log
root@sv-35:~# rm -f /var/log/contrail/contrail-alarm-gen-0-stdout.log
root@sv-35:~# ls -lart /var/log/contrail/contrail-alarm-gen-0-stdout.log
ls: cannot access /var/log/contrail/contrail-alarm-gen-0-stdout.log: No such file or directory
root@sv-35:~# ls -lart /var/log/contrail/contrail-alarm-gen-0-stdout.log
ls: cannot access /var/log/contrail/contrail-alarm-gen-0-stdout.log: No such file or directory
root@sv-35:~# ls -lart /var/log/contrail/contrail-alarm-gen-0-stdout.log
ls: cannot access /var/log/contrail/contrail-alarm-gen-0-stdout.log: No such file or directory

[root@10.0.0.134] Executing task 'stop_collector'
2017-08-03 13:48:37:485017: [root@10.0.0.134] sudo: service supervisor-analytics stop
2017-08-03 13:48:37:485274: [root@10.0.0.134] out: supervisor-analytics stop/waiting
2017-08-03 13:49:22:778913: [root@10.0.0.134] out: 
2017-08-03 13:49:22:779408: 
2017-08-03 13:49:22:779728: [root@10.0.0.135] Executing task 'stop_collector'
2017-08-03 13:49:22:780124: [root@10.0.0.135] sudo: service supervisor-analytics stop
2017-08-03 13:49:22:780520: [root@10.0.0.135] out: 
2017-08-03 14:11:51:352998: [root@10.0.0.135] out: 
2017-08-03 14:11:51:785694: [root@10.0.0.135] out: 
2017-08-03 14:22:11:156542: [root@10.0.0.135] out: 
2017-08-03 14:22:11:357726: [root@10.0.0.135] out: 
2017-08-03 14:22:11:522193: [root@10.0.0.135] out: 
2017-08-03 14:25:13:052065: [root@10.0.0.135] out: 
2017-08-03 14:25:13:252827: [root@10.0.0.135] out: 
2017-08-03 14:25:13:653796: [root@10.0.0.135] out: 
2017-08-03 14:25:13:955165: [root@10.0.0.135] out: supervisor-analytics stop/waiting
2017-08-03 14:26:04:152905: [root@10.0.0.135] out: 
2017-08-03 14:26:04:153275: 
2017-08-03 14:26:04:153553: [root@10.0.0.136] Executing task 'stop_collector'

Revision history for this message

OpenContrail Admin (ci-admin-f) wrote on 2017-08-09: [Review update] R3.1

#3

Review in progress for https://review.opencontrail.org/34412
Submitter: Megh Bhatt (<email address hidden>)

Revision history for this message

OpenContrail Admin (ci-admin-f) wrote on 2017-08-11: A change has been merged

#4

Reviewed: https://review.opencontrail.org/34412
Committed: http://github.com/Juniper/contrail-sandesh/commit/640747f4e16d435dee9f89a2281f9353dcdd8afb
Submitter: Zuul (<email address hidden>)
Branch: R3.1

commit 640747f4e16d435dee9f89a2281f9353dcdd8afb
Author: Megh Bhatt <email address hidden>
Date: Wed Aug 9 13:31:04 2017 -0700

Add SandeshLogger API to configure user specified logger

Add a SandeshLogger API - set_logger_params() that allows to configure
user specified logger as per the logging parameters similar to how
sandesh logger is configured. contrail-alarm-gen will use the API
to configure kafka and kazoo loggers.

Change-Id: I33ddf504010b27e9ac5726ff596608e5f8f05f68
Partial-Bug: #1706832

Revision history for this message

Sandeep Sridhar (ssandeep) wrote on 2017-08-11:

#5

Megh - Which build in R3.1 Branch has this fix?

Greetings,
Sandeep.

Revision history for this message

OpenContrail Admin (ci-admin-f) wrote on 2017-08-11: [Review update] R3.1

#6

Review in progress for https://review.opencontrail.org/34490
Submitter: Megh Bhatt (<email address hidden>)

Revision history for this message

OpenContrail Admin (ci-admin-f) wrote on 2017-08-11: [Review update] R3.2

#8

Review in progress for https://review.opencontrail.org/34491
Submitter: Megh Bhatt (<email address hidden>)

Revision history for this message

OpenContrail Admin (ci-admin-f) wrote on 2017-08-11: [Review update] R4.0

#9

Review in progress for https://review.opencontrail.org/34492
Submitter: Megh Bhatt (<email address hidden>)

Revision history for this message

OpenContrail Admin (ci-admin-f) wrote on 2017-08-11: [Review update] master

#10

Review in progress for https://review.opencontrail.org/34493
Submitter: Megh Bhatt (<email address hidden>)

Revision history for this message

OpenContrail Admin (ci-admin-f) wrote on 2017-08-12: A change has been merged

#11

Reviewed: https://review.opencontrail.org/34491
Committed: http://github.com/Juniper/contrail-sandesh/commit/06315e74ee87463c8f09f91096e618faff631b79
Submitter: Zuul (<email address hidden>)
Branch: R3.2

commit 06315e74ee87463c8f09f91096e618faff631b79
Author: Megh Bhatt <email address hidden>
Date: Wed Aug 9 13:31:04 2017 -0700

Add SandeshLogger API to configure user specified logger

Add a SandeshLogger API - set_logger_params() that allows to configure
user specified logger as per the logging parameters similar to how
sandesh logger is configured. contrail-alarm-gen will use the API
to configure kafka and kazoo loggers.

Change-Id: I33ddf504010b27e9ac5726ff596608e5f8f05f68
Partial-Bug: #1706832
(cherry picked from commit 640747f4e16d435dee9f89a2281f9353dcdd8afb)

Revision history for this message

OpenContrail Admin (ci-admin-f) wrote on 2017-08-12:

#12

Reviewed: https://review.opencontrail.org/34493
Committed: http://github.com/Juniper/contrail-sandesh/commit/6e6ca0bc59722320ae48e9ecb83fa9ec756c0794
Submitter: Zuul (<email address hidden>)
Branch: master

commit 6e6ca0bc59722320ae48e9ecb83fa9ec756c0794
Author: Megh Bhatt <email address hidden>
Date: Wed Aug 9 13:31:04 2017 -0700

Add SandeshLogger API to configure user specified logger

Add a SandeshLogger API - set_logger_params() that allows to configure
user specified logger as per the logging parameters similar to how
sandesh logger is configured. contrail-alarm-gen will use the API
to configure kafka and kazoo loggers.

Change-Id: I33ddf504010b27e9ac5726ff596608e5f8f05f68
Partial-Bug: #1706832
(cherry picked from commit 640747f4e16d435dee9f89a2281f9353dcdd8afb)

Revision history for this message

OpenContrail Admin (ci-admin-f) wrote on 2017-08-15: [Review update] R3.2

#13

Review in progress for https://review.opencontrail.org/34561
Submitter: Megh Bhatt (<email address hidden>)

Revision history for this message

OpenContrail Admin (ci-admin-f) wrote on 2017-08-15: [Review update] R4.0

#15

Review in progress for https://review.opencontrail.org/34562
Submitter: Megh Bhatt (<email address hidden>)

Revision history for this message

OpenContrail Admin (ci-admin-f) wrote on 2017-08-15: [Review update] master

#17

Review in progress for https://review.opencontrail.org/34563
Submitter: Megh Bhatt (<email address hidden>)

Revision history for this message

OpenContrail Admin (ci-admin-f) wrote on 2017-08-15: A change has been merged

#19

Reviewed: https://review.opencontrail.org/34490
Committed: http://github.com/Juniper/contrail-controller/commit/d235608dd6b5084a20b2a80f1ed32286e3e20263
Submitter: Zuul (<email address hidden>)
Branch: R3.1

commit d235608dd6b5084a20b2a80f1ed32286e3e20263
Author: Megh Bhatt <email address hidden>
Date: Fri Aug 11 11:14:35 2017 -0700

Configure kafka and kazoo loggers in contrail-alarm-gen

When contrail-alarm-gen disconnects from kafka, it retries
to reconnect to it and spits out bunch of messages. The kafka
python library logger was configured to output those messages
to stdout and due to this supervisor-analytics would get very
busy and not be able to handle other commands like stop all
and hence service supervisor-analytics stop would hang. Fix is
to configure the kafka and kazoo loggers in contrail-alarm-gen
as per the logging parameters and log to the file instead.

Change-Id: I553c53b3b549269a8cc0a87feda137980f008643
Closes-Bug: #1706832

Revision history for this message

OpenContrail Admin (ci-admin-f) wrote on 2017-08-15:

#21

Reviewed: https://review.opencontrail.org/34492
Committed: http://github.com/Juniper/contrail-sandesh/commit/21fca210c72f7f62511bf4caaed78c070a5d4879
Submitter: Zuul (<email address hidden>)
Branch: R4.0

commit 21fca210c72f7f62511bf4caaed78c070a5d4879
Author: Megh Bhatt <email address hidden>
Date: Wed Aug 9 13:31:04 2017 -0700

Add SandeshLogger API to configure user specified logger

Add a SandeshLogger API - set_logger_params() that allows to configure
user specified logger as per the logging parameters similar to how
sandesh logger is configured. contrail-alarm-gen will use the API
to configure kafka and kazoo loggers.

Change-Id: I33ddf504010b27e9ac5726ff596608e5f8f05f68
Partial-Bug: #1706832
(cherry picked from commit 640747f4e16d435dee9f89a2281f9353dcdd8afb)

Revision history for this message

OpenContrail Admin (ci-admin-f) wrote on 2017-08-16:

#22

Reviewed: https://review.opencontrail.org/34563
Committed: http://github.com/Juniper/contrail-controller/commit/e2571c1e6d2dc1f16d3f51cac0b2d48a59edf8ae
Submitter: Zuul (<email address hidden>)
Branch: master

commit e2571c1e6d2dc1f16d3f51cac0b2d48a59edf8ae
Author: Megh Bhatt <email address hidden>
Date: Fri Aug 11 11:14:35 2017 -0700

Configure kafka and kazoo loggers in contrail-alarm-gen

When contrail-alarm-gen disconnects from kafka, it retries
to reconnect to it and spits out bunch of messages. The kafka
python library logger was configured to output those messages
to stdout and due to this supervisor-analytics would get very
busy and not be able to handle other commands like stop all
and hence service supervisor-analytics stop would hang. Fix is
to configure the kafka and kazoo loggers in contrail-alarm-gen
as per the logging parameters and log to the file instead.

Change-Id: I553c53b3b549269a8cc0a87feda137980f008643
Closes-Bug: #1706832

Revision history for this message

OpenContrail Admin (ci-admin-f) wrote on 2017-08-18:

#23

Reviewed: https://review.opencontrail.org/34562
Committed: http://github.com/Juniper/contrail-controller/commit/6dfd78fca3b1c105b31de8b8cbaefada29d9f042
Submitter: Zuul (<email address hidden>)
Branch: R4.0

commit 6dfd78fca3b1c105b31de8b8cbaefada29d9f042
Author: Megh Bhatt <email address hidden>
Date: Fri Aug 11 11:14:35 2017 -0700

Configure kafka and kazoo loggers in contrail-alarm-gen

When contrail-alarm-gen disconnects from kafka, it retries
to reconnect to it and spits out bunch of messages. The kafka
python library logger was configured to output those messages
to stdout and due to this supervisor-analytics would get very
busy and not be able to handle other commands like stop all
and hence service supervisor-analytics stop would hang. Fix is
to configure the kafka and kazoo loggers in contrail-alarm-gen
as per the logging parameters and log to the file instead.

Change-Id: I553c53b3b549269a8cc0a87feda137980f008643
Closes-Bug: #1706832

Revision history for this message

OpenContrail Admin (ci-admin-f) wrote on 2017-08-21:

#24

Reviewed: https://review.opencontrail.org/34561
Committed: http://github.com/Juniper/contrail-controller/commit/01c786e94e5f3ccb360b8d47bd3b3a58a385aec1
Submitter: Zuul (<email address hidden>)
Branch: R3.2

commit 01c786e94e5f3ccb360b8d47bd3b3a58a385aec1
Author: Megh Bhatt <email address hidden>
Date: Fri Aug 11 11:14:35 2017 -0700

Configure kafka and kazoo loggers in contrail-alarm-gen

When contrail-alarm-gen disconnects from kafka, it retries
to reconnect to it and spits out bunch of messages. The kafka
python library logger was configured to output those messages
to stdout and due to this supervisor-analytics would get very
busy and not be able to handle other commands like stop all
and hence service supervisor-analytics stop would hang. Fix is
to configure the kafka and kazoo loggers in contrail-alarm-gen
as per the logging parameters and log to the file instead.

Change-Id: I553c53b3b549269a8cc0a87feda137980f008643
Closes-Bug: #1706832

	Status	Importance	Assigned to	Milestone
Juniper Openstack	Status tracked in Trunk
R3.1	Fix Committed	High	Megh Bhatt	Juniper Openstack r3.1.4.0
R3.2	Fix Committed	High	Megh Bhatt	Juniper Openstack r3.2.5.0
R4.0	Fix Committed	High	Megh Bhatt	Juniper Openstack r4.0.1.0 "r4.0.1.0"
Trunk	Fix Committed	High	Megh Bhatt	Juniper Openstack r4.1.0.0-fcs "r4.1"

Juniper Openstack

3.1.3-75:Potential bug in restore_cassandra_db

Bug Description

Other bug subscribers

Remote bug watches