Neutron services losing connections to rabbitmq-server

Bug #1669456 reported by Francis Ginther on 2017-03-02
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Autopilot Log Analyser
High
Francis Ginther
OpenStack neutron-api charm
Undecided
Unassigned
OpenStack neutron-gateway charm
Undecided
Unassigned
OpenStack rabbitmq-server charm
Undecided
Unassigned

Bug Description

This was found with an automated test of Landscape Openstack Autopilot in our CI, [1].

It is not obvious to me what the actual problem is, but there is a lot that appears broken between rabbitmq-server and the neutron services. Several neutron services appear to have issues connecting and staying connected to rabbitmq. The rabbitmq services also report lots of closed connections due to missing heartbeats. I can't tell if rabbitmq-server is dropping connections due to being over-loaded or if the network itself is having trouble or something else.

I've seen this twice now in our automated testing, and will make the logs available for both if needed. I'll attach the neutron* and rabbitmq-server logs directly to this bug for the CI run mentioned in [1].

Versions:
LDS: 17.01~bzr10932+jenkins3510-2
JUJU: 1:2.1.0-0ubuntu1~16.04.3~juju1
MAAS: 2.1.3+bzr5573-0ubuntu1
OPENSTACK_RELEASE: newton
OBJECT: ceph
BLOCK: iscsi
rabbitmq-server charm: cs:xenial/rabbitmq-server-61
neutron-api charm: cs:xenial/neutron-api-249
neutron-gateway charm: cs:xenial/neutron-gateway-234

Charm configuration:
rabbitmq-server
 - min-cluster-size: 3
neutron-api
 - neutron-security-groups: true
 - flat-network-providers: physnet1
 - enable-l3ha: true
 - enable-dvr: false
 - l2-population: false
 - region: (set to the region name)
neutron-gateway
 - instance-mtu: 1454
 - bridge-mappings: physnet1:br-data
 - data-port: (set to list of connected nics)

The problem ultimately presents itself when landscape fails when trying to create the initial neutron networks and router despite trying multiple times. Eventually it gives up and fails the deployment:

[from landscape-server-0/var/log/landscape-server/job-handler.log]
Feb 28 23:11:30 job-handler-1 ERR RetryingCall for '_create_networks_and_router' failed last retry.
Feb 28 23:11:30 job-handler-1 ERR Failed to execute job: Missing alive/up 'neutron-dhcp-agent' and 'neutron-openvswitch-agent' on elkhart.#012Traceback (most recent call last):#012 File "/opt/canonical/landscape/canonical/landscape/model/activity/jobrunner.py", line 40, in run#012 yield self._run_activity(account_id, activity_id)#012 File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1126, in _inlineCallbacks#012 result = result.throwExceptionIntoGenerator(g)#012 File "/usr/lib/python2.7/dist-packages/twisted/python/failure.py", line 389, in throwExceptionIntoGenerator#012 return g.throw(self.type, self.value, self.tb)#012 File "/opt/canonical/landscape/canonical/landscape/model/openstack/jobs/network.py", line 111, in _run_activity#012 tenant_name, auth)#012 File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1126, in _inlineCallbacks#012 result = result.throwExceptionIntoGenerator(g)#012 File "/usr/lib/python2.7/dist-packages/twisted/python/failure.py", line 389, in throwExceptionIntoGenerator#012 return g.throw(self.type, self.value, self.tb)#012 File "/opt/canonical/landscape/canonical/landscape/model/openstack/jobs/network.py", line 157, in _create_networks_and_router#012 yield helper.check_neutron_ready()#012 File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1126, in _inlineCallbacks#012 result = result.throwExceptionIntoGenerator(g)#012 File "/usr/lib/python2.7/dist-packages/twisted/python/failure.py", line 389, in throwExceptionIntoGenerator#012 return g.throw(self.type, self.value, self.tb)#012 File "/opt/canonical/landscape/canonical/landscape/model/openstack/jobs/network.py", line 245, in check_neutron_ready#012 yield self._check_neutron_agents_up()#012 File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1128, in _inlineCallbacks#012 result = g.send(result)#012 File "/opt/canonical/landscape/canonical/landscape/model/openstack/jobs/network.py", line 271, in _check_neutron_agents_up#012 raise NeutronAgentNotUp(message)#012NeutronAgentNotUp: Missing alive/up 'neutron-dhcp-agent' and 'neutron-openvswitch-agent' on elkhart.

Examples of neutron and rabbitmq services having problems:

[from landscape-0-inner-logs/neutron-gateway-0/var/log/syslog]
Feb 28 23:00:50 clipper neutron-metering-agent[363130]: 2017-02-28 23:00:50.136 363130 ERROR oslo.messaging._drivers.impl_rabbit [req-932b1afa-3c52-4673-902e-b23e53fd6640 - - - - -] [0aa16391-052c-4084-9bae-8e77fef9ca9c] AMQP server 10.96.65.63:5672 closed the connection. Check login credentials: Socket closed
Feb 28 23:00:51 clipper neutron-metering-agent[363130]: 2017-02-28 23:00:51.155 363130 INFO oslo.messaging._drivers.impl_rabbit [req-932b1afa-3c52-4673-902e-b23e53fd6640 - - - - -] [0aa16391-052c-4084-9bae-8e77fef9ca9c] Reconnected to AMQP server on 10.96.65.63:5672 via [amqp] clientwith port 52878.
Feb 28 23:00:58 clipper neutron-lbaasv2-agent[369886]: 2017-02-28 23:00:58.662 369886 ERROR oslo.messaging._drivers.impl_rabbit [-] [d77bd393-3578-4fb3-92b5-d46869a9e2c6] AMQP server on 10.96.65.40:5672 is unreachable: timed out. Trying again in 1 seconds. Client port: 60096
Feb 28 23:00:59 clipper neutron-lbaasv2-agent[369886]: 2017-02-28 23:00:59.681 369886 INFO oslo.messaging._drivers.impl_rabbit [-] [d77bd393-3578-4fb3-92b5-d46869a9e2c6] Reconnected to AMQP server on 10.96.65.40:5672 via [amqp] clientwith port 33076.
Feb 28 23:01:09 clipper neutron-dhcp-agent[370126]: 2017-02-28 23:01:09.833 370126 ERROR oslo.messaging._drivers.impl_rabbit [req-b0ed177d-c1d0-4eb2-9515-427d8b2af991 - - - - -] [ac59175f-5c33-40e5-bf58-fb098881a5b4] AMQP server on 10.96.65.63:5672 is unreachable: timed out. Trying again in 1 seconds. Client port: 51862
Feb 28 23:01:10 clipper neutron-dhcp-agent[370126]: 2017-02-28 23:01:10.867 370126 INFO oslo.messaging._drivers.impl_rabbit [req-b0ed177d-c1d0-4eb2-9515-427d8b2af991 - - - - -] [ac59175f-5c33-40e5-bf58-fb098881a5b4] Reconnected to AMQP server on 10.96.65.63:5672 via [amqp] clientwith port 53077.
Feb 28 23:01:10 clipper neutron-dhcp-agent[370126]: 2017-02-28 23:01:10.958 370126 WARNING oslo.service.loopingcall [req-b0ed177d-c1d0-4eb2-9515-427d8b2af991 - - - - -] Function 'neutron.agent.dhcp.agent.DhcpAgentWithStateReport._report_state' run outlasted interval by 3025.52 sec

[from <email address hidden>]
=ERROR REPORT==== 28-Feb-2017::23:13:55 ===
closing AMQP connection <0.29983.1> (10.96.66.74:50418 -> 10.96.65.40:5672):
Missed heartbeats from client, timeout: 60s

=INFO REPORT==== 28-Feb-2017::23:14:09 ===
accepting AMQP connection <0.30289.1> (10.96.65.44:46668 -> 10.96.65.40:5672)

=ERROR REPORT==== 28-Feb-2017::23:14:12 ===
closing AMQP connection <0.30008.1> (10.96.66.74:50942 -> 10.96.65.40:5672):
Missed heartbeats from client, timeout: 60s

[1] - https://ci.lscape.net/job/landscape-system-tests/5413/

Changed in landscape:
assignee: nobody → Francis Ginther (fginther)
milestone: none → 17.02
Changed in autopilot-log-analyser:
status: New → In Progress
importance: Undecided → High
assignee: nobody → Francis Ginther (fginther)
David Ames (thedac) wrote :

Francis,

Just fyi, I do not have access ci.landscape.net, but I did look at the attached logs.

Can we rule out high load issues that have been a problem in the past in this instance?

The logs seem to indicate cascading failure, including not being able to reach the rabbit hosts at all:

2017-02-28 21:58:42.312 108426 ERROR oslo.messaging._drivers.impl_rabbit [req-f7be21f4-f063-4bad-a15a-1fb70e6de6d0 - - - - -] [f3ef7b8b-322a-48d9-a27f-631e2bd0ca32] AMQP server on 10.96.65.40:5672 is unreachable: timed out. Trying again in 1 seconds. Client port: 57718

I suspect rabbit connection issues are the symptom and not the cause.

When you see this can you gather some memory/cpu load information on the hosts?
What does this hardware look like? How many cores? We have mentioned the worker-multiplier issue before.
Percona-cluster can at times be misconfiguration to take too much memory. Check on its health. We have also made some improvements to percona in next that help guarantee this does not occur.

Changed in charm-neutron-api:
status: New → Incomplete
Changed in charm-neutron-gateway:
status: New → Incomplete
Changed in charm-rabbitmq-server:
status: New → Incomplete
Changed in landscape:
status: New → Triaged
Francis Ginther (fginther) wrote :

David,

The hardware for this testing all have 4 cores and 16GB of RAM. We are smooshing multiple services onto each host. The logs do contain ps output under var/log/ps-fauxww.txt which can give a vague idea of what is going on. The base-machine units will have a complete view of all of the processes across all LXCs as well as a ps_mem.txt file which has an aggregate view of memory usage.

After filing this bug, we have reduced the worker-multiplier to 1. We will keep looking for this. I don't think any further action is needed on this right now.

Changed in autopilot-log-analyser:
status: In Progress → Fix Committed
Chad Smith (chad.smith) on 2017-03-16
Changed in landscape:
milestone: 17.02 → 17.03
Changed in landscape:
status: Triaged → Incomplete
no longer affects: landscape
Zhengwei Gao (multi-task) wrote :

Same issue. Is there a solution to this? @Francis

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers