It looks like this only happens in HA setups, where pacemaker controlls RabbitMQ.
If I use a single controller OOO deployment it doesn't happen; after restarting RabbitMQ all messages are sent again by the ceilometermiddleware again.
But with a three-node controller setup controlled by pcs this won't work. Even after the controllers are up again, the ceilometermiddleware (or more exact oslo.messaging) won't reconnect successfully. The following message will be repeated over and over:
Oct 31 10:04:00 host-192-0-2-15 proxy-server: STDERR: ERROR:oslo.messaging._drivers.impl_rabbit:[275d772c-b7d4-4010-b07e-01dd10c3b1a4] AMQP server on 172.17.0.16:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 14 seconds. Client port: 43564 (txn: txa0bd3ae6c911476aa0698-0058174f25) (client_ip: 172.18.0.18)
However, the server is up - connecting manually to 172.17.0.16:5672 (for example using the telnet client) succeeds. But looking at other services these are connecting to a different RabbitMQ instances after RabbitMQ was restarted.
So I think it makes sense to add the other RabbitMQ hosts as well to the list of nodes.
Another option is to use the nonblocking_notify option in ceilometermiddleware, which also fixes the problem itself.
It looks like this only happens in HA setups, where pacemaker controlls RabbitMQ.
If I use a single controller OOO deployment it doesn't happen; after restarting RabbitMQ all messages are sent again by the ceilometermiddl eware again.
But with a three-node controller setup controlled by pcs this won't work. Even after the controllers are up again, the ceilometermiddl eware (or more exact oslo.messaging) won't reconnect successfully. The following message will be repeated over and over:
Oct 31 10:04:00 host-192-0-2-15 proxy-server: STDERR: ERROR:oslo. messaging. _drivers. impl_rabbit: [275d772c- b7d4-4010- b07e-01dd10c3b1 a4] AMQP server on 172.17.0.16:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 14 seconds. Client port: 43564 (txn: txa0bd3ae6c9114 76aa0698- 0058174f25) (client_ip: 172.18.0.18)
However, the server is up - connecting manually to 172.17.0.16:5672 (for example using the telnet client) succeeds. But looking at other services these are connecting to a different RabbitMQ instances after RabbitMQ was restarted.
So I think it makes sense to add the other RabbitMQ hosts as well to the list of nodes.
Another option is to use the nonblocking_notify option in ceilometermiddl eware, which also fixes the problem itself.
I'm going to propose another upstream patch.