Hi Brian Haley and Swaminathan Vasudevan, I reproduced the bug in master branch, following the steps:
1. kill a dvr_snat l3 agent
2. create a DVR+HA router
3. start the dvr_snat l3 agent
4. the error logs will continue to be output
The reason is that when the l3 agent does fullsync, for every router, it calls ensure_snat_cleanup depending on whether the agent is dvr_snat or not, since [1]. However, DVR+HA routers always have snat namespaces on dvr_snat agents holding themselves for keepalived. Therefore, the cleanup call is unexpected and cause that the _process_updated_router method always catch an Exception and then put the router back to the RouterProcessingQueue again and again.
Hi Brian Haley and Swaminathan Vasudevan, I reproduced the bug in master branch, following the steps:
1. kill a dvr_snat l3 agent
2. create a DVR+HA router
3. start the dvr_snat l3 agent
4. the error logs will continue to be output
The reason is that when the l3 agent does fullsync, for every router, it calls ensure_snat_cleanup depending on whether the agent is dvr_snat or not, since [1]. However, DVR+HA routers always have snat namespaces on dvr_snat agents holding themselves for keepalived. Therefore, the cleanup call is unexpected and cause that the _process_ updated_ router method always catch an Exception and then put the router back to the RouterProcessin gQueue again and again.
[1] https:/ /review. openstack. org/#/c/ 326729/
I have submitted a patch for this: https:/ /review. openstack. org/434863