Upgrading to pike version causes rabbit timeouts with ssl
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
oslo.messaging |
Fix Released
|
High
|
Magnus Bergman | ||
oslo.messaging (Ubuntu) |
Confirmed
|
Undecided
|
Unassigned |
Bug Description
We have discovered an issue when upgrading our clouds from ocata to pike.
oslo.messaging versions
ocata: 5.17.1
pike: 5.30.0
python-amqp versions
ocata: 1.4.9
pike: 2.1.4
On upgrading to pike we get several issues with neutron-dhcp-agent and nova-compute.
The error we see is:
2018-11-01 10:05:48.580 7908 ERROR neutron.
2018-11-01 10:05:48.580 7908 ERROR neutron.
2018-11-01 10:05:48.580 7908 ERROR neutron.
2018-11-01 10:05:48.580 7908 ERROR neutron.
2018-11-01 10:05:48.580 7908 ERROR neutron.
2018-11-01 10:05:48.580 7908 ERROR neutron.
2018-11-01 10:05:48.580 7908 ERROR neutron.
2018-11-01 10:05:48.580 7908 ERROR neutron.
2018-11-01 10:05:48.580 7908 ERROR neutron.
2018-11-01 10:05:48.580 7908 ERROR neutron.
2018-11-01 10:05:48.580 7908 ERROR neutron.
2018-11-01 10:05:48.580 7908 ERROR neutron.
2018-11-01 10:05:48.580 7908 ERROR neutron.
2018-11-01 10:05:48.580 7908 ERROR neutron.
2018-11-01 10:05:48.580 7908 ERROR neutron.
2018-11-01 10:05:48.580 7908 ERROR neutron.
2018-11-01 10:05:48.580 7908 ERROR neutron.
2018-11-01 10:05:48.580 7908 ERROR neutron.
2018-11-01 10:05:48.580 7908 ERROR neutron.
Steps to reproduce are:
Start neutron-dhcp-agent with no networks being hosted on it.
agent reporting is fine, I have manually pdb'd this and triggered the agent report hundreds of times every 1-2 seconds and neutron-server always responds in ~1 second.
Now schedule a network onto the agent
Now the agent sync times out.
I can see the reply queue in rabbit and it starts to fill up with unacked messages and the agent starts to produce the stack trace above consistently.
Removing the network and restarting the agent gets the agent reporting normally again.
Now if I do the same thing except don't use the rabbit ssl port and setting everything works flawlessly.
We also see this behaviour with nova-compute. Something happens and then all messages get stuck in unack and timeouts appear in the log.
I suspect this could be more to do with the python-amqp version but I'm not certain.
We've tried with the SSL in rabbitmq and used versions 3.6.5 and 3.6.10, we've also tried using an F5 LB in front to offload SSL to that but to no avail.
description: | updated |
Changed in oslo.messaging: | |
assignee: | nobody → Ken Giusti (kgiusti) |
Changed in oslo.messaging: | |
status: | Incomplete → Confirmed |
Changed in oslo.messaging: | |
importance: | Undecided → High |
I should also note that downgrading back to the ocata versions of everything fixes this.