L3 agent hangs when rabbitmq cluster fails often

Bug #1448024 reported by Eugene Nikanorov
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Invalid
Medium
Oleg Bondarev

Bug Description

L3 agent literally hangs in conditions when rabbitmq cluster often failing apart and repairing.

The last lines of logs seen from L3 agent:

2015-04-24 10:07:14.205 30273 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.16.10.4:5673 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 15 seconds.
2015-04-24 10:07:14.230 30273 INFO oslo.messaging._drivers.impl_rabbit [-] Connecting to AMQP server on 172.16.10.5:5673
2015-04-24 10:07:14.244 30273 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.16.10.5:5673 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 15 seconds.
2015-04-24 10:07:14.245 30273 INFO oslo.messaging._drivers.impl_rabbit [-] Connecting to AMQP server on 172.16.10.5:5673
2015-04-24 10:07:14.256 30273 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.16.10.5:5673 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 15 seconds.
2015-04-24 10:07:14.737 30273 INFO oslo.messaging._drivers.impl_rabbit [-] Connecting to AMQP server on 172.16.10.5:5673
2015-04-24 10:07:14.752 30273 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.16.10.5:5673 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 15 seconds.
2015-04-24 10:07:28.215 30273 DEBUG neutron.openstack.common.periodic_task [-] Running periodic task L3NATAgentWithStateReport.periodic_sync_routers_task run_periodic_tasks /usr/lib/python2.7/dist-packages/neutron/openstack/common/periodic_task.py:193
2015-04-24 10:07:28.216 30273 DEBUG neutron.agent.l3_agent [-] Starting _sync_routers_task - fullsync:False _sync_routers_task /usr/lib/python2.7/dist-packages/neutron/agent/l3_agent.py:1856
2015-04-24 10:07:28.906 30273 INFO oslo.messaging._drivers.impl_rabbit [-] Delaying reconnect for 1.0 seconds .

That could be an issue in the oslo messaging somehow hanging on reconnect attempt.

further analysis shows that the process is hanging in
epoll_wait(4, {}, 1023, 0) = 0

Tags: neutron
Changed in mos:
milestone: none → 6.1
description: updated
Changed in mos:
assignee: nobody → MOS Neutron (mos-neutron)
status: New → Confirmed
Revision history for this message
Eugene Nikanorov (enikanorov) wrote :

One important (but yet unconfirmed) condition that may provide a hint to this is that rsyslog server is also constantly restarting for some reason.

Some time ago we've fixed an issue with eventlet wich made services to hang (spin in busy wait) consuming 100% cpu on rsyslog restart.
Here we may face same issue, so that is worth checking.

Revision history for this message
Eugene Nikanorov (enikanorov) wrote :

in comment #1: not "same" issue, but "similar issue"

Changed in mos:
status: Confirmed → In Progress
Revision history for this message
Oleg Bondarev (obondarev) wrote :

Original zendesk ticket: https://mirantis.zendesk.com/agent/tickets/6189
Fixed by Sergey Yudin:
"i believe the problem was syslog and affected by syslog corosync/pacemaker/rabbit/whatever
i've changed the permission too log spooler dir - it has root user mod which causes syslog to restart every ~ 1m
i've also removed logstash from configfile of syslog cause logstash was not reachable(have no idea if it even should be)

after that i've restarted all services which may use rabbit"

Though neutron agents (oslo_messaging?) may behave better (not "hang" but fail and provide helpful logs) in cases where rsyslog don't work correctly I'd consider lowering the importance to medium.

Changed in mos:
importance: Critical → Medium
milestone: 6.1 → 7.0
status: In Progress → Confirmed
Changed in mos:
assignee: MOS Neutron (mos-neutron) → Oleg Bondarev (obondarev)
Revision history for this message
Oleg Bondarev (obondarev) wrote :
Changed in mos:
status: Confirmed → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.