ceilometer collector can not reconnect to rabbitmq after restarting rabbitmq

Bug #1393708 reported by Ai Jie Niu
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Ceilometer
Invalid
Undecided
Unassigned
oslo.messaging
Invalid
Undecided
Unassigned

Bug Description

Collector.log report keep print such issues
2014-11-18 03:45:49.698 30888 INFO oslo.messaging._drivers.impl_rabbit [-] Delaying reconnect for 1.0 seconds...
2014-11-18 03:45:50.700 30888 INFO oslo.messaging._drivers.impl_rabbit [-] Connecting to AMQP server on 10.11.1.15:5671
2014-11-18 03:45:50.708 30888 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 10.11.1.15:5671 is unreachable: [Errno 8] _ssl.c:492: EOF occurred in violation of protocol. Trying again in 13 seconds.
2014-11-18 03:46:03.720 30888 INFO oslo.messaging._drivers.impl_rabbit [-] Delaying reconnect for 1.0 seconds...
2014-11-18 03:46:04.721 30888 INFO oslo.messaging._drivers.impl_rabbit [-] Connecting to AMQP server on 10.11.1.15:5671
2014-11-18 03:46:04.731 30888 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 10.11.1.15:5671 is unreachable: [Errno 8] _ssl.c:492: EOF occurred in violation of protocol. Trying again in 13 seconds.
2014-11-18 03:46:17.736 30888 INFO oslo.messaging._drivers.impl_rabbit [-] Delaying reconnect for 1.0 seconds...
2014-11-18 03:46:18.738 30888 INFO oslo.messaging._drivers.impl_rabbit [-] Connecting to AMQP server on 10.11.1.15:5671
2014-11-18 03:46:18.746 30888 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 10.11.1.15:5671 is unreachable: timed out. Trying again in 15 seconds.

while in Rabbitmq log it report
=INFO REPORT==== 18-Nov-2014::03:46:50 ===
accepting AMQP connection <0.1433.0> (10.11.1.15:46977 -> 10.11.1.15:5671)

=ERROR REPORT==== 18-Nov-2014::03:46:55 ===
error on AMQP connection <0.1433.0>:
{ssl_upgrade_error,timeout}

=INFO REPORT==== 18-Nov-2014::03:47:08 ===
accepting AMQP connection <0.1527.0> (10.11.1.15:46979 -> 10.11.1.15:5671)

=ERROR REPORT==== 18-Nov-2014::03:47:13 ===
error on AMQP connection <0.1527.0>:
{ssl_upgrade_error,timeout}

=INFO REPORT==== 18-Nov-2014::03:47:26 ===
accepting AMQP connection <0.1541.0> (10.11.1.15:46980 -> 10.11.1.15:5671)

=ERROR REPORT==== 18-Nov-2014::03:47:31 ===
error on AMQP connection <0.1541.0>:
{ssl_upgrade_error,timeout}

seems the connection between ceilometer collector and rabbitmq is keep connectted->timeout->connectted->timeout, and go on

after restarting ceilometer collector server the issue gone
although the code to connect to rabbitmq is placed at oslo.messaging, but didn't find such issue in other component, like nova and neutron

Revision history for this message
ZhiQiang Fan (aji-zqfan) wrote :

do the other component in ceilometer have same issue too?

Revision history for this message
Ai Jie Niu (niuaj) wrote :

hi, @ZhiQiang, no, other ceilometer service can reconnect to mq successful, this is very strange
and sometimes collector can reconnect success too, but in most of the situation, it can not

Revision history for this message
Ai Jie Niu (niuaj) wrote :

We found that if set the executor='blocking', when create a rpc server, this issue will gone
def get_rpc_server(transport, topic, endpoint):
    """Return a configured oslo.messaging rpc server."""
    cfg.CONF.import_opt('host', 'ceilometer.service')
    target = oslo.messaging.Target(server=cfg.CONF.host, topic=topic)
    serializer = RequestContextSerializer(JsonPayloadSerializer())
    return oslo.messaging.get_rpc_server(transport, target,
                                         [endpoint], executor='blocking',
                                         serializer=serializer)

is someone know whey ceilometer use eventlet as the executor value while other component use blocking

Revision history for this message
Ethan Lynn (ethanlynn) wrote :

Any update for this bug? I encounter this bug at production.

Reproduce Steps:
1. Restart rabbitmq
2. Check Collector.log, will see a lot of oslo.messaging error

Only collector has this problem, other service can reconnect to rabbitmq after it's back.

Only solution is to restart collector service, after restart, it can be connect to rabbitmq successfully. But I don't think it's a finally solution. Ceilometer collector should automatically reconnect to message queue.

Revision history for this message
gordon chung (chungg) wrote :

i tried using master and oslo.messaging 1.8.0 and the collector was able to reconnect fine.

Changed in ceilometer:
status: New → Incomplete
Revision history for this message
Ethan Lynn (ethanlynn) wrote :

Sorry, forgot to say my production env is juno version.
my oslo messaging version is 1.4.0

Changed in ceilometer:
status: Incomplete → New
gordon chung (chungg)
affects: ceilometer → oslo.messaging
affects: oslo.messaging → ceilometer
Revision history for this message
gordon chung (chungg) wrote :

Ethan, this bug might be of interest: https://bugs.launchpad.net/oslo.messaging/+bug/1338732

Revision history for this message
Ethan Lynn (ethanlynn) wrote :

Thanks gordon, Seems I need to patch oslo manually.
I will try.

Revision history for this message
Mehdi Abaakouk (sileht) wrote :
Changed in oslo.messaging:
status: New → Incomplete
gordon chung (chungg)
Changed in ceilometer:
status: New → Invalid
Revision history for this message
QingchuanHao (haoqingchuan-28) wrote :

this problem can be reproduced by rabbitmq-server and client not using ssl simultaneously.

Changed in oslo.messaging:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.