[n-d-r] DRAgent removed from bgp speaker automatically

Bug #2041612 reported by Roberto Bartzen Acosta
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
New
Medium
Unassigned

Bug Description

Hello everyone,

I noticed a BUG in the Neutron agent reschedule process when I'm running the 'BGP dynamic routing agent'.

I tried to found some n-d-r related documentation about the scheduler process for L3 agents, but I've just saw routers context topics [1].

The main problem is the FSM of the L3 reschedule method, since it was developed for router specifc purposes, and the n-d-r has a different context. To be clear, when this reschedule fails for the DRAgent, an administrative action (bgp_speaker association in a DRAgent) will be automatically removed from the Neutron configuration, and even if the agent comes back, it will no longer be associated with the speaker.

The reschedule_resources_from_down_agents method was develop to "Reschedule routers from down l3 agents if admin state is up", and this same method calls reschedule_resources_from_down_agents [2].

Basically, the problem ocurrs after '4 * agent_down_time' because the reschedule process starts '2 * agent_down_time' without agent communication [3], and try to reschedule until more 2 * agent_down_time [4].

Ofcouse this problem only occurs because the DRAgent has not established communication with Neutron server in the meantime of the reschedule process. The root cause was a RabbitMQ and/or MySQL issue, but it does make sense to me that the speaker is removed from the DRAgent, considering that the link between the BGP speaker and the DRAgent is an administrative action.

Logs about the occurrence of the BUG below:

For reference: agent_down_time = 1200

Oct 20 09:06:15 srv-neutron-server-container-9b119379.1a.ne.jaxyendy.com neutron-server[113]: 2023-10-20 09:06:15.563 113 WARNING neutron.db.agentschedulers_db [req-8e46c4f2-bfeb-4fe7-8726-ea3fd1d76392 - - - - -] Rescheduling bgp_speaker 6addb3dc-7282-4be2-9f40-b234563c1089 from agent 424ad86f-164c-454b-ad64-6f9f7ab2fb72 because the agent did not report to the server in the last 2400 seconds.

Oct 20 09:07:05 srv-neutron-server-container-9b119379.1a.ne.jaxyendy.com neutron-server[113]: 2023-10-20 09:07:05.630 113 ERROR neutron.db.agentschedulers_db [req-8e46c4f2-bfeb-4fe7-8726-ea3fd1d76392 - - - - -] Exception encountered during bgp_speaker rescheduling.: oslo_db.exception.DBDeadlock: (pymysql.err.OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction')#012[SQL: DELETE FROM bgp_speaker_dragent_bindings WHERE bgp_speaker_dragent_bindings.bgp_speaker_id = %(bgp_speaker_id_1)s AND bgp_speaker_dragent_bindings.agent_id = %(agent_id_1)s]#012[parameters: {'bgp_speaker_id_1': '6addb3dc-7282-4be2-9f40-b234563c1089', 'agent_id_1': '424ad86f-164c-454b-ad64-6f9f7ab2fb72'}]

Oct 20 09:16:15 srv-neutron-server-container-9b119379.1a.ne.jaxyendy.com neutron-server[113]: 2023-10-20 09:16:15.563 113 WARNING neutron.db.agentschedulers_db [req-8e46c4f2-bfeb-4fe7-8726-ea3fd1d76392 - - - - -] Rescheduling bgp_speaker 6addb3dc-7282-4be2-9f40-b234563c1089 from agent 424ad86f-164c-454b-ad64-6f9f7ab2fb72 because the agent did not report to the server in the last 2400 seconds.

Oct 20 09:17:05 srv-neutron-server-container-9b119379.1a.ne.jaxyendy.com neutron-server[113]: 2023-10-20 09:17:05.588 113 ERROR neutron.db.agentschedulers_db [req-8e46c4f2-bfeb-4fe7-8726-ea3fd1d76392 - - - - -] Exception encountered during bgp_speaker rescheduling.: oslo_db.exception.DBDeadlock: (pymysql.err.OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction')#012[SQL: DELETE FROM bgp_speaker_dragent_bindings WHERE bgp_speaker_dragent_bindings.bgp_speaker_id = %(bgp_speaker_id_1)s AND bgp_speaker_dragent_bindings.agent_id = %(agent_id_1)s]#012[parameters: {'bgp_speaker_id_1': '6addb3dc-7282-4be2-9f40-b234563c1089', 'agent_id_1': '424ad86f-164c-454b-ad64-6f9f7ab2fb72'}]

Oct 20 09:26:15 srv-neutron-server-container-9b119379.1a.ne.jaxyendy.com neutron-server[113]: 2023-10-20 09:26:15.572 113 WARNING neutron.db.agentschedulers_db [req-8e46c4f2-bfeb-4fe7-8726-ea3fd1d76392 - - - - -] Rescheduling bgp_speaker 6addb3dc-7282-4be2-9f40-b234563c1089 from agent 424ad86f-164c-454b-ad64-6f9f7ab2fb72 because the agent did not report to the server in the last 2400 seconds.

Oct 20 09:27:05 srv-neutron-server-container-9b119379.1a.ne.jaxyendy.com neutron-server[113]: 2023-10-20 09:27:05.591 113 ERROR neutron.db.agentschedulers_db [req-8e46c4f2-bfeb-4fe7-8726-ea3fd1d76392 - - - - -] Exception encountered during bgp_speaker rescheduling.: oslo_db.exception.DBDeadlock: (pymysql.err.OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction')#012[SQL: DELETE FROM bgp_speaker_dragent_bindings WHERE bgp_speaker_dragent_bindings.bgp_speaker_id = %(bgp_speaker_id_1)s AND bgp_speaker_dragent_bindings.agent_id = %(agent_id_1)s]#012[parameters: {'bgp_speaker_id_1': '6addb3dc-7282-4be2-9f40-b234563c1089', 'agent_id_1': '424ad86f-164c-454b-ad64-6f9f7ab2fb72'}]

..........

Oct 20 10:56:15 srv-neutron-server-container-9b119379.1a.ne.jaxyendy.com neutron-server[113]: 2023-10-20 10:56:15.758 113 WARNING neutron.db.agentschedulers_db [req-8e46c4f2-bfeb-4fe7-8726-ea3fd1d76392 - - - - -] Rescheduling bgp_speaker 6addb3dc-7282-4be2-9f40-b234563c1089 from agent 424ad86f-164c-454b-ad64-6f9f7ab2fb72 because the agent did not report to the server in the last 2400 seconds.

Oct 20 10:57:05 srv-neutron-server-container-9b119379.1a.ne.jaxyendy.com neutron-server[113]: 2023-10-20 10:57:05.781 113 ERROR neutron.db.agentschedulers_db [req-8e46c4f2-bfeb-4fe7-8726-ea3fd1d76392 - - - - -] Exception encountered during bgp_speaker rescheduling.: oslo_db.exception.DBDeadlock: (pymysql.err.OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction')#012[SQL: DELETE FROM bgp_speaker_dragent_bindings WHERE bgp_speaker_dragent_bindings.bgp_speaker_id = %(bgp_speaker_id_1)s AND bgp_speaker_dragent_bindings.agent_id = %(agent_id_1)s]#012[parameters: {'bgp_speaker_id_1': '6addb3dc-7282-4be2-9f40-b234563c1089', 'agent_id_1': '424ad86f-164c-454b-ad64-6f9f7ab2fb72'}]

After that, DRAgent was permanently removed from the BGP speaker.

[1] https://docs.openstack.org/neutron/latest/admin/config-services-agent.html#l3-agent
[2] https://opendev.org/openstack/neutron/src/commit/53f4fd6b9fcb4f8ba907bfbace342bf902fc55f7/neutron/db/l3_agentschedulers_db.py#L65
[3] https://opendev.org/openstack/neutron/src/commit/f90d7d2a9e49688f7baee13268c3ba34f59d5a16/neutron/db/agentschedulers_db.py#L123
[4] https://opendev.org/openstack/neutron/src/commit/f90d7d2a9e49688f7baee13268c3ba34f59d5a16/neutron/db/agentschedulers_db.py#L146

description: updated
Miguel Lavalle (minsel)
Changed in neutron:
importance: Undecided → Medium
Revision history for this message
Brian Haley (brian-haley) wrote :

For reference, this might be addressed by this N-D-R bug/rfe:

https://bugs.launchpad.net/neutron/+bug/1920065

Revision history for this message
Roberto Bartzen Acosta (rbartzen) wrote :

Hello folks, thanks for the feedback.

I performed some tests, and the static scheduler config option solves the issue, as I never see the "Rescheduling bgp_speaker" message again.

BTW would it make sense to set the static scheduler as the default config option? since the operator probably has more than one DRAgent/speaker in HA, it would not be a good idea to remove and add the scheduler automatically. I mean, there is no way to make this work using other type of schedulers since the operator can set a DRAgent to a specific speaker, and a dynamic scheduler type can generate inconsistencies.

I can propose a patch to make this option as the default, and IMO the other scheduler types could be deprecated in the future, as suggested by fricker at the Neutron meeting today.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.