tripleo

Bug #1637443
Comment #5

Comment 5 for bug 1637443

Revision history for this message

Christian Schwede (cschwede) wrote on 2016-10-31:

It looks like this only happens in HA setups, where pacemaker controlls RabbitMQ.

If I use a single controller OOO deployment it doesn't happen; after restarting RabbitMQ all messages are sent again by the ceilometermiddleware again.

But with a three-node controller setup controlled by pcs this won't work. Even after the controllers are up again, the ceilometermiddleware (or more exact oslo.messaging) won't reconnect successfully. The following message will be repeated over and over:

Oct 31 10:04:00 host-192-0-2-15 proxy-server: STDERR: ERROR:oslo.messaging._drivers.impl_rabbit:[275d772c-b7d4-4010-b07e-01dd10c3b1a4] AMQP server on 172.17.0.16:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 14 seconds. Client port: 43564 (txn: txa0bd3ae6c911476aa0698-0058174f25) (client_ip: 172.18.0.18)

However, the server is up - connecting manually to 172.17.0.16:5672 (for example using the telnet client) succeeds. But looking at other services these are connecting to a different RabbitMQ instances after RabbitMQ was restarted.

So I think it makes sense to add the other RabbitMQ hosts as well to the list of nodes.

Another option is to use the nonblocking_notify option in ceilometermiddleware, which also fixes the problem itself.

I'm going to propose another upstream patch.