Collector is unable to reconnect to the queue
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Ceilometer |
Invalid
|
Undecided
|
Unassigned | ||
oslo.messaging |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
* Problem description
Collector frequently detects the "Connection reset by peer" (Errno 104) on rabbit queue connection. Once this happens, collector tries to reconnect to the queue server. However, it never succeeds and reconnection continues forever.
Please look at the log below.
Restarting the child process of collector (worker) that is under the problem solves the situation temporarily: parent process of collector restart a worker, and the worker can receive messages from queue for a while. However, soon the connection reset and the reconnection failuer starts.
The time when the connection reset happens is not stable: it may happen as soon as the collector (worker) starts to receive message from queue, or may happen after 10 messages are received. Anyhow, it doesn't take very long to happen.
For detailed steps to produce the problem, please look at the "Reproduce" below.
* Environment
Openstack Juno-1 release.
OS is Ununtu 13.10, RabbitMQ is 3.1.3-1, oslo.messaging is 1.3.0.
When testing with openstack Icehouse-3 release, the problem does not happen (OS and RabbitMQ are the same as above, oslo.messaging is not used in i-3).
The network is a LAN that is dedicated for the test.
* Reproduce
There are two patterns to reproduce the problem.
- Pattern one:
1. Create 30 vm instances on a nova-compute
2. Start the Collector
3. Stop the Collector
4. Start the Compute Agent on the host that nova-compute is running
5. The Compute Agent put some messages in the queue
6. Start the Collector
7. Watch that collector periodically receive messages from the central agent.
After some periods, connection reset happens and the collector cannot reconnect anymore.
Steps both 3. and 5. seem to be critical.
If we do not stop the Collector at all, the problem does not happen. Also, if we omit the steps 2. and 3., the problem does not happen. That is, connecting to the existing queue (created by step 2) seems to cause the problem.
If we have stopped the Collector but the queue has been empty before the Collector is started again, the problem does not happen.
The same problem happens when we use the Central Agent instead of the Compute Agent.
- Pattern two:
1. Create 30 vm instances on a nova-compute, and some 1040 images on glance, and 80 floating ips
2. Start the Collector
3. Start the Compute Agent and Central Agent
4. After agents finished sending messages to the queue, stop and restart them.
This emulates large amount of messages are continuously queued.
5. Watch that collector receive messages.
After some periods, connection reset happens and the collector cannot reconnect anymore.
The point in this case seems to be a large amount of messages (step 4).
* Logs
- Collector log
-----------
2014-07-01 16:38:30.309 10691 INFO ceilometer.
2014-07-01 16:38:31.847 10692 INFO ceilometer.
2014-07-01 16:38:31.863 10692 INFO ceilometer.
2014-07-01 16:38:32.581 10691 INFO ceilometer.
2014-07-01 16:38:32.599 10691 ERROR oslo.messaging.
2014-07-01 16:38:33.601 10691 INFO oslo.messaging.
2014-07-01 16:38:33.602 10691 INFO oslo.messaging.
2014-07-01 16:38:33.698 10692 INFO ceilometer.
2014-07-01 16:38:33.715 10692 INFO ceilometer.
2014-07-01 16:38:33.776 10692 INFO ceilometer.
2014-07-01 16:38:33.793 10692 INFO ceilometer.
2014-07-01 16:38:34.631 10691 ERROR oslo.messaging.
-----------
- AMQP server log
(not the same time with collector log, but messages are essentially the same)
-----------
=ERROR REPORT==== 30-Jun-
closing AMQP connection <0.15119.0> (10.0.1.11:39911 -> 10.0.1.12:5672):
{handshake_
=INFO REPORT==== 30-Jun-
accepting AMQP connection <0.15190.0> (10.0.1.11:39917 -> 10.0.1.12:5672)
=ERROR REPORT==== 30-Jun-
closing AMQP connection <0.15126.0> (10.0.1.11:39912 -> 10.0.1.12:5672):
{handshake_
=INFO REPORT==== 30-Jun-
accepting AMQP connection <0.15194.0> (10.0.1.11:39918 -> 10.0.1.12:5672)
=ERROR REPORT==== 30-Jun-
closing AMQP connection <0.15150.0> (10.0.1.11:39913 -> 10.0.1.12:5672):
{handshake_
=INFO REPORT==== 30-Jun-
accepting AMQP connection <0.15217.0> (10.0.1.11:39919 -> 10.0.1.12:5672)
=ERROR REPORT==== 30-Jun-
closing AMQP connection <0.15157.0> (10.0.1.11:39914 -> 10.0.1.12:5672):
{handshake_
-----------
no longer affects: | oslo |
Changed in ceilometer: | |
status: | New → Invalid |
Changed in oslo.messaging: | |
status: | New → Fix Released |
Might be worth trying oslo.messaging 1.4.0.0a3 just to make sure it's not an issue that has been fixed since. Please let us know if it helps - we would backport any fix from master to stable/icehouse
Also, you could use something like tcpdump/wireshark to see if there's anything suspicious happening on the wire - perhaps connections being opened and closed far more than we'd expect?