Comment 21 for bug 1289200

Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

Below is a Release Notes friendly description of the current state of this bug with all fixes merged to master so far and 3 additional fixes:
- https://review.openstack.org/93411 rabbitmq-keepalive
- https://review.openstack.org/93883 rabbitmq-hosts-shuffle
- https://bugs.launchpad.net/fuel/+bug/1321451 python-kombu-and-amqp-upgrade

Controller failover may cause Nova to fail to start VM instances 2014-05-20
---------------------------------------------------------------------------

If one of the Controller nodes abruptly goes offline, it is possible that some
of the TCP connections from Nova services on Compute nodes to RabbitMQ on the
failed Controller node will not be immediately terminated.

When that happens, RPC communication between the nova-compute service and Nova
services on Controller nodes stops, and Nova becomes unable to manage VM
instances on the affected Compute nodes. Instances that were previously
launched on these nodes continue running but cannot be stopped or modified, and
new instances scheduled to the affected nodes will fail to launch.

After 2 hours (sooner if the failed controller node is brought back online),
zombie TCP connections are terminated; after that, Nova services on affected
Compute nodes reconnect to RabbitMQ on one of the operational Controller nodes
and RPC communication is restored. Manual restart of nova-compute service on
affected nodes also results in immediate recovery.

See `LP1289200 <https://bugs.launchpad.net/fuel/+bug/1289200>`_.