Collector is unable to reconnect to the queue

Bug #1337715 reported by Rikimaru Honjo
42
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Ceilometer
Invalid
Undecided
Unassigned
oslo.messaging
Fix Released
Undecided
Unassigned

Bug Description

* Problem description

Collector frequently detects the "Connection reset by peer" (Errno 104) on rabbit queue connection. Once this happens, collector tries to reconnect to the queue server. However, it never succeeds and reconnection continues forever.
Please look at the log below.

Restarting the child process of collector (worker) that is under the problem solves the situation temporarily: parent process of collector restart a worker, and the worker can receive messages from queue for a while. However, soon the connection reset and the reconnection failuer starts.
The time when the connection reset happens is not stable: it may happen as soon as the collector (worker) starts to receive message from queue, or may happen after 10 messages are received. Anyhow, it doesn't take very long to happen.

For detailed steps to produce the problem, please look at the "Reproduce" below.

* Environment

Openstack Juno-1 release.
OS is Ununtu 13.10, RabbitMQ is 3.1.3-1, oslo.messaging is 1.3.0.
When testing with openstack Icehouse-3 release, the problem does not happen (OS and RabbitMQ are the same as above, oslo.messaging is not used in i-3).
The network is a LAN that is dedicated for the test.

* Reproduce

There are two patterns to reproduce the problem.

- Pattern one:

1. Create 30 vm instances on a nova-compute
2. Start the Collector
3. Stop the Collector
4. Start the Compute Agent on the host that nova-compute is running
5. The Compute Agent put some messages in the queue
6. Start the Collector
7. Watch that collector periodically receive messages from the central agent.
  After some periods, connection reset happens and the collector cannot reconnect anymore.

Steps both 3. and 5. seem to be critical.
If we do not stop the Collector at all, the problem does not happen. Also, if we omit the steps 2. and 3., the problem does not happen. That is, connecting to the existing queue (created by step 2) seems to cause the problem.
If we have stopped the Collector but the queue has been empty before the Collector is started again, the problem does not happen.
The same problem happens when we use the Central Agent instead of the Compute Agent.

- Pattern two:

1. Create 30 vm instances on a nova-compute, and some 1040 images on glance, and 80 floating ips
2. Start the Collector
3. Start the Compute Agent and Central Agent
4. After agents finished sending messages to the queue, stop and restart them.
  This emulates large amount of messages are continuously queued.
5. Watch that collector receive messages.
  After some periods, connection reset happens and the collector cannot reconnect anymore.

The point in this case seems to be a large amount of messages (step 4).

* Logs

- Collector log
-----------
2014-07-01 16:38:30.309 10691 INFO ceilometer.dispatcher.database [-] record_metering_data start 31
2014-07-01 16:38:31.847 10692 INFO ceilometer.dispatcher.database [-] record_metering_data end
2014-07-01 16:38:31.863 10692 INFO ceilometer.dispatcher.database [-] record_metering_data start 31
2014-07-01 16:38:32.581 10691 INFO ceilometer.dispatcher.database [-] record_metering_data end
2014-07-01 16:38:32.599 10691 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 10.0.1.22:5672 is unreachable: [Errno 104] Connection reset by peer. Trying again in 1 seconds.
2014-07-01 16:38:33.601 10691 INFO oslo.messaging._drivers.impl_rabbit [-] Reconnecting to AMQP server on 10.0.1.22:5672
2014-07-01 16:38:33.602 10691 INFO oslo.messaging._drivers.impl_rabbit [-] Delaying reconnect for 1.0 seconds...
2014-07-01 16:38:33.698 10692 INFO ceilometer.dispatcher.database [-] record_metering_data end
2014-07-01 16:38:33.715 10692 INFO ceilometer.dispatcher.database [-] record_metering_data start 1
2014-07-01 16:38:33.776 10692 INFO ceilometer.dispatcher.database [-] record_metering_data end
2014-07-01 16:38:33.793 10692 INFO ceilometer.dispatcher.database [-] record_metering_data start 31
2014-07-01 16:38:34.631 10691 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 10.0.1.22:5672 is unreachable: [Errno 104] Connection reset by peer. Trying again in 1 seconds.
-----------

 - AMQP server log
(not the same time with collector log, but messages are essentially the same)
-----------
=ERROR REPORT==== 30-Jun-2014::11:03:54 ===
closing AMQP connection <0.15119.0> (10.0.1.11:39911 -> 10.0.1.12:5672):
{handshake_timeout,frame_header}

=INFO REPORT==== 30-Jun-2014::11:03:55 ===
accepting AMQP connection <0.15190.0> (10.0.1.11:39917 -> 10.0.1.12:5672)

=ERROR REPORT==== 30-Jun-2014::11:03:56 ===
closing AMQP connection <0.15126.0> (10.0.1.11:39912 -> 10.0.1.12:5672):
{handshake_timeout,handshake}

=INFO REPORT==== 30-Jun-2014::11:03:57 ===
accepting AMQP connection <0.15194.0> (10.0.1.11:39918 -> 10.0.1.12:5672)

=ERROR REPORT==== 30-Jun-2014::11:03:58 ===
closing AMQP connection <0.15150.0> (10.0.1.11:39913 -> 10.0.1.12:5672):
{handshake_timeout,handshake}

=INFO REPORT==== 30-Jun-2014::11:03:59 ===
accepting AMQP connection <0.15217.0> (10.0.1.11:39919 -> 10.0.1.12:5672)

=ERROR REPORT==== 30-Jun-2014::11:03:59 ===
closing AMQP connection <0.15157.0> (10.0.1.11:39914 -> 10.0.1.12:5672):
{handshake_timeout,handshake}
-----------

Ben Nemec (bnemec)
no longer affects: oslo
Revision history for this message
Mark McLoughlin (markmc) wrote :

Might be worth trying oslo.messaging 1.4.0.0a3 just to make sure it's not an issue that has been fixed since. Please let us know if it helps - we would backport any fix from master to stable/icehouse

Also, you could use something like tcpdump/wireshark to see if there's anything suspicious happening on the wire - perhaps connections being opened and closed far more than we'd expect?

Revision history for this message
Rikimaru Honjo (honjo-rikimaru-c6) wrote :

Hi Mark, thank you for your comment.

I tried 1.4.0.0a3. Latter half of the problem, that infinite reconnection loop, is resolved: by setting
rabbit_max_retries = 3, after 3 unsuccessful retries, exception is raised and traceback is logged. Then, connection is established again.
However, the main problem, that frequent "Connection reset by peer", still remains.
I'll be grad for any suggestion about this.

As for packet capturing, I will give it a try.

Revision history for this message
Rikimaru Honjo (honjo-rikimaru-c6) wrote :

I have tried tcpdump. The result shows the followings:

1. Collector (AMQP client) tries to create new connection
2. TCP connection is established (3-way handshake is done)
3. 10 seconds passes before RabbitMQ handshake completes
4. AMQP server sends RST packet, which leads to "Connection reset by peer"

In step 3., normally, AMQP client starts the handshake and completes within 10 seconds. However, in error cases, either client does not start sending packet,
or, client starts the handshake but the process does not complete, within 10 seconds.

Please look at the logs below (shows the pattern that client starts the handshake but the process does not complete).

The immediate cause for the problem seems to be the handshake timeout of the rabbitMQ. However, the value seems to be hard-coded as 10 seconds in rabbitMQ source. Therefore it is hard for us to modify this value.
Moreover, the Collector (AMQP client) does not seem to be heavy loaded: it is just doing its normal task that receiving messages from queue and writing them into DB.

Are there any workaround to this problem? Any suggestion is welcome.

Collector log:

2014-07-14 19:55:57.888 8613 DEBUG oslo.messaging._drivers.amqp [-] Pool creating new connection create /usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqp.py:71
2014-07-14 19:55:57.890 8613 INFO oslo.messaging._drivers.impl_rabbit [-] Connecting to AMQP server on 10.0.1.22:5672
[...]
2014-07-14 19:56:08.972 8613 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 10.0.1.22:5672 is unreachable: [Errno 104] Connection reset by peer. Trying again in 1 seconds.

Collector log (tcpdump):

19:55:57.905290 IP ceilo21.47803 > ceilo22.amqp: Flags [S], seq 1639030281, win 29200, options [mss 1460,
sackOK,TS val 116185220 ecr 0,nop,wscale 7], length 0
19:55:57.905488 IP ceilo22.amqp > ceilo21.47803: Flags [S.], seq 3761066390, ack 1639030282, win 28960, o
ptions [mss 1460,sackOK,TS val 116186277 ecr 116185220,nop,wscale 7], length 0
19:55:57.905523 IP ceilo21.47803 > ceilo22.amqp: Flags [.], ack 1, win 229, options [nop,nop,TS val 11618
5220 ecr 116186277], length 0
[...]
19:56:07.173335 IP ceilo21.47803 > ceilo22.amqp: Flags [P.], seq 1:9, ack 1, win 229, options [nop,nop,TS val 116187537 ecr 116186277], length 8
19:56:07.173523 IP ceilo22.amqp > ceilo21.47803: Flags [.], ack 9, win 227, options [nop,nop,TS val 116188594 ecr 116187537], length 0
19:56:07.173786 IP ceilo22.amqp > ceilo21.47803: Flags [P.], seq 1:336, ack 9, win 227, options [nop,nop,TS val 116188594 ecr 116187537], length 335
19:56:07.173797 IP ceilo21.47803 > ceilo22.amqp: Flags [.], ack 336, win 237, options [nop,nop,TS val 116187537 ecr 116188594], length 0
[...]
19:56:07.906991 IP ceilo22.amqp > ceilo21.47803: Flags [R.], seq 336, ack 9, win 227, options [nop,nop,TS
 val 116188777 ecr 116187537], length 0

RabbitMQ log:

=INFO REPORT==== 14-Jul-2014::19:55:57 ===
accepting AMQP connection <0.21001.73> (10.0.1.21:47803 -> 10.0.1.22:5672)
[...]
=ERROR REPORT==== 14-Jul-2014::19:56:07 ===
closing AMQP connection <0.21001.73> (10.0.1.21:47803 -> 10.0.1.22:5672):
{handshake_timeout,frame_header}

Revision history for this message
Rikimaru Honjo (honjo-rikimaru-c6) wrote :

[Found workarounds]
Two workarounds are found.

One is to modify the rabbitMQ source: change the HANDSHAKE_TIMEOUT to longer value. By changing it from 10 to 60 seconds, the error disappeared in our environment.
However, when I heavy-load the rabbitMQ as in pattern two of this bug report, the error still happens, though the frequency becomes lower. Thus, changing HANDSHAKE_TIMEOUT may not be an ultimate fix, but it works as a workaround.
For those who are interested in: HANDSHAKE_TIMEOUT is still hard-coded in current release (3.3.4) of rabbitMQ. But in master code, a new config parameter, handshake_timeout, is introduced.

The other workaround is to change the rpc_thread_pool_size. By changing it from 64 (default) to 5, the error disappeared in our environment. However, setting this value to 10, the error appeared again. Therefore, appropriate value seems to depend on environment.

[Question]
Thus, it turned out that workarounds exist, but I'm in doubt whether these are fixes for primary cause.
Looking at logs again, I found 'connection_closed_abruptly' warning in rabbitMQ log frequently, which means that client closed the connection without following the proper AMQP shutdown protocol. Furthermore, from tcpdump, it seems that client behave as follows: open a connection, get a message from queue, then close the connection. That is, not reusing a connection at all, making new connection for every message read, which seems to be great overhead.
Since I'm not familiar to oslo, I might be wrong. Can someone correct my thought, or things are working 'right' in this way?

Revision history for this message
Nejc Saje (nejc-saje) wrote :

Thank you for the extensive research Rikimaru! The problem appears to be resolved in oslo.messaging 1.4.1.

I could reproduce it with oslo.messaging 1.3.0 by writing a script that pumps metering messages into the metering queue (https://gist.github.com/nsaje/70b60d4ac6876467d9cd). With latest oslo.messaging, the problem does not appear.

Please re-open this bug if you can still run into this issue with the latest Ceilometer Juno-3 and oslo.messaging 1.4.0.

Nejc Saje (nejc-saje)
Changed in ceilometer:
status: New → Invalid
Mehdi Abaakouk (sileht)
Changed in oslo.messaging:
status: New → Fix Released
Revision history for this message
QingchuanHao (haoqingchuan-28) wrote :

@Mehdi Abaakouk, how the code changing from self designed reonnect to kombu-provided reconnect fix the handshake-timeout problem?

Revision history for this message
QingchuanHao (haoqingchuan-28) wrote :

The problem occurs in this way.
client reconnects the rabbitmq-server when rabbitmq-server is down, and after delaying time exceed 10 seconds, the client won't connect to rabbitmq-server even if rabbitmq-server is reachable.

client part.
client block on sock.connect for delaying time mentioned above, which completes the three-handshake, and should follows closely sending the protocol header.

server part.
Server try to accept the connection, and send time out after 10 seconds to close the connection when connection is complete. And connection is closed after 10 seconds for handshake time out.

I don't know what blocked the connect. And strace tool showed that after socket.connect, sleep (delaying time) was called.

Revision history for this message
Mehdi Abaakouk (sileht) wrote :

Since oslo.messaging 1.5.0, the have removed our custom reconnection code and directly use the kombu API to do that and also some kombu timeout was set the None before, now we have set it to 1 second like the kombu eventloop does, to ensure that every underlying socket.* call are will return after 1 second instead of stucking forever.

Revision history for this message
QingchuanHao (haoqingchuan-28) wrote :

That greenthreads block each other seems to no use monkey_path(), and as a result the original time.sleep is used.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.