[R 2.20] Some VMs stuck in error state when nova-conductor connection to RabbitMQ gets torn down intermittently

Bug #1463070 reported by amit surana
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Juniper Openstack
Won't Fix
Medium
Ranjeet R
R2.20
Won't Fix
Medium
Ranjeet R
R3.0
Won't Fix
Medium
Ranjeet R
R3.1
Won't Fix
Medium
Ranjeet R
Trunk
Won't Fix
Medium
Ranjeet R

Bug Description

3-5 out of 32 VMs get stuck in error state when launched simultaneously. Seems like nova-compute (on one of those compute nodes) to conductor communication is broken. Ranjeeth has looked at the setup.

Version 2.20 build 34.

Logs are here:

10.84.5.100:/cs-shared/bugs/1463070/

Tags: snat
amit surana (asurana-t)
description: updated
Jeba Paulaiyan (jebap)
tags: added: snat
information type: Proprietary → Public
Revision history for this message
Ranjeet R (rranjeet-n) wrote :

I see there are RabbitMQ reconnects and because of which some of the VM spawns fail at different places. One way to solve this to retry VM spawns when it fails which would allow the SNAT testing to go through.

Revision history for this message
Ranjeet R (rranjeet-n) wrote :

After updating the system with the OSLO messaging patches from - https://github.com/openstack/oslo.messaging/commit/858353394dd15101ebcb1c4b39e672bc4aa2bd86 - I was able to launch 100 VMs without any issues. There were patching issues in the cluster which was fixed. Given that, I recommend to remove the blocker status in this bug.

Revision history for this message
Ranjeet R (rranjeet-n) wrote :

In scale, we are intermittent issues where nova-conductor connection to rabbitmq gets torn down and then reconnects back. We will keep this bug open to track the nova-conductor reconnect issue.

summary: - [R 2.20] Some VMs stuck in error state when scaling SNAT setup
+ [R 2.20] Some VMs stuck in error state when nova-conductor connection to
+ RabbitMQ gets torn down intermittently
Revision history for this message
Sunil Bakhru (sbakhru) wrote :

Once Oslo patch is in, the main issue seems to be resolved. The issue of intermittent nova-conductor connection reconnect at scaled config is something we will monitor but doesn't gate the release.

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Bug update]

bug update...

Revision history for this message
Ranjeet R (rranjeet-n) wrote :

We have moved to a new model of Rabbitmq connectivity where it is not going through HA Proxy. This is not reproducible anymore.

Sachin Bansal (sbansal)
Changed in juniperopenstack:
status: New → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.