Comment 8 for bug 1838449

Revision history for this message
Miguel Lavalle (minsel) wrote :

Describing first what "SHOULD HAPPEN" in a successful run of test_from_dvr_to_dvr_ha (see http://paste.openstack.org/show/769794/):

1) Test sets router "admin_state_up": false

2) L3 agents in controller and compute1 receive the router update notification and delete the router locally

In a failed execution, this is the observed sequence of events (see http://paste.openstack.org/show/769795/):

1) Test sets router "admin_state_up": false

2) L3 agents in compute1 (in this case) receives the router update notification and delete the router locally

3) L3 agent in controller receives an update router notification with related routers. The agent doesn't delete locally the router and queues for processing the related routers. As a consequence, the router ports are not set to status DOWN and the test case times out

I have observed this pattern several times.

My main suspect at this point in time is https://github.com/openstack/neutron/blob/78aae12a88e8b3cc0609c830527533b8a8a92d60/neutron/db/l3_dvrscheduler_db.py#L141-L153. This code was last modified in rapid succession by patches https://review.opendev.org/#/c/664525 and https://review.opendev.org/#/c/661522. It was originally created by https://review.opendev.org/#/c/597567