Comment 0 for bug 1800957

Revision history for this message
Sam Morrison (sorrison) wrote :

We have been discovered an issue when upgrading our clouds from ocata to pike.

oslo.messaging versions
ocata: 5.17.1
pike: 5.30.0

python-amqp versions
ocata: 1.4.9
pike: 2.1.4

On upgrading to pike we get several issues with neutron-dhcp-agent and nova-compute.

The error we see is:

2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent [req-79e8c605-055e-4354-b749-7dd7baabf864 - - - - -] Failed reporting state!: MessagingTimeout: Timed out waiting for a reply to message ID ae039d1695984addbfaaef032ce4fda3
2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent Traceback (most recent call last):
2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent File "/usr/lib/python2.7/dist-packages/neutron/agent/dhcp/agent.py", line 740, in _report_state
2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent ctx, self.agent_state, True)
2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent File "/usr/lib/python2.7/dist-packages/neutron/agent/rpc.py", line 92, in report_state
2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent return method(context, 'report_state', **kwargs)
2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/client.py", line 169, in call
2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent retry=self.retry)
2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent File "/usr/lib/python2.7/dist-packages/oslo_messaging/transport.py", line 123, in _send
2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent timeout=timeout, retry=retry)
2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 578, in send
2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent retry=retry)
2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 567, in _send
2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent result = self._waiter.wait(msg_id, timeout)
2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 459, in wait
2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent message = self.waiters.get(msg_id, timeout=timeout)
2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 347, in get
2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent 'to message ID %s' % msg_id)
2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent MessagingTimeout: Timed out waiting for a reply to message ID ae039d1695984addbfaaef032ce4fda3

Steps to reproduce are:

Start neutron-dhcp-agent with no networks being hosted on it.
agent reporting is fine, I have manually pdb'd this and triggered the agent report hundreds of times every 1-2 seconds and neutron-server always responds in ~1 second.

Now schedule a network onto the agent
Now the agent sync times out.

I can see the reply queue in rabbit and it starts to fill up with unacked messages and the agent starts to produce the stack trace above consistently.

Removing the network and restarting the agent gets the agent reporting normally again.

Now if I do the same thing except don't use the rabbit ssl port and setting everything works flawlessly.

We also see this behaviour with nova-compute. Something happens and then all messages get stuck in unack and timeouts appear in the log.

I suspect this could be more to do with the python-amqp version but I'm not certain.
We've tried with the SSL in rabbitmq and used versions 3.6.5 and 3.6.10, we've also tried using an F5 LB in front to offload SSL to that but to no avail.