Now, we could add an rpc_ha_retry_limit to attempt another send, basically another param that we act on. But this narrow window that we are hitting explains why sometimes you see it and sometimes you don't, just depends on how quick everything falls into place.
Some questions though:
If the host dies, how can the amqp client view the connection as valid and send anything if it can't establish a connection to the rabbit host? How is it able to accept a connection?
Is it possible ensure_publishing isn't handling a disconnect well? or kombu?
So this is a very narrow window that is being hit:
https:/ /github. com/openstack/ neutron/ blob/master/ neutron/ common/ rpc.py# L127
When the oslo.messaging driver fetches the connection, it has to hit a tiny window when it views the socket as not timed out:
https:/ /github. com/openstack/ oslo.messaging/ blob/master/ oslo_messaging/ _drivers/ amqpdriver. py#L419
The connection is validated and if it's viewed as a valid connection, the message is sent:
https:/ /github. com/openstack/ oslo.messaging/ blob/master/ oslo_messaging/ _drivers/ impl_rabbit. py#L705
Ultimately, we get the traceback and it times out waiting for the reply, and it's handled by just killing the request:
https:/ /github. com/openstack/ neutron/ blob/master/ neutron/ common/ rpc.py# L128
Now, we could add an rpc_ha_retry_limit to attempt another send, basically another param that we act on. But this narrow window that we are hitting explains why sometimes you see it and sometimes you don't, just depends on how quick everything falls into place.
Some questions though:
If the host dies, how can the amqp client view the connection as valid and send anything if it can't establish a connection to the rabbit host? How is it able to accept a connection?
Is it possible ensure_publishing isn't handling a disconnect well? or kombu?
https:/ /github. com/openstack/ oslo.messaging/ blob/master/ oslo_messaging/ _drivers/ impl_rabbit. py#L1156