Comment 15 for bug 1648242

Revision history for this message
Brian Stajkowski (brian-stajkowski) wrote : Re: Failure to retry update_ha_routers_states

So this is a very narrow window that is being hit:

https://github.com/openstack/neutron/blob/master/neutron/common/rpc.py#L127

When the oslo.messaging driver fetches the connection, it has to hit a tiny window when it views the socket as not timed out:

https://github.com/openstack/oslo.messaging/blob/master/oslo_messaging/_drivers/amqpdriver.py#L419

The connection is validated and if it's viewed as a valid connection, the message is sent:

https://github.com/openstack/oslo.messaging/blob/master/oslo_messaging/_drivers/impl_rabbit.py#L705

Ultimately, we get the traceback and it times out waiting for the reply, and it's handled by just killing the request:

https://github.com/openstack/neutron/blob/master/neutron/common/rpc.py#L128

Now, we could add an rpc_ha_retry_limit to attempt another send, basically another param that we act on. But this narrow window that we are hitting explains why sometimes you see it and sometimes you don't, just depends on how quick everything falls into place.

Some questions though:
If the host dies, how can the amqp client view the connection as valid and send anything if it can't establish a connection to the rabbit host? How is it able to accept a connection?

Is it possible ensure_publishing isn't handling a disconnect well? or kombu?

https://github.com/openstack/oslo.messaging/blob/master/oslo_messaging/_drivers/impl_rabbit.py#L1156