Activity log for bug #2041612

Date Who What changed Old value New value Message
2023-10-27 11:19:59 Roberto Bartzen Acosta bug added bug
2023-10-27 11:20:41 Roberto Bartzen Acosta description Hello everyone, I noticed a BUG in the Neutron agent reschedule process when I'm running the 'BGP dynamic routing agent'. I tried to found some n-d-r related documentation about the scheduler process for L3 agents, but I've just saw routers context topics [1]. The main problem is the FSM of the L3 reschedule method, since it was developed for router specifc purposes, and the n-d-r has a different context. To be clear, when this reschedule fails for the DRAgent, an administrative action (bgp_speaker association in a DRAgent) will be automatically removed from the Neutron configuration, and even if the agent comes back, it will no longer be associated with the speaker. The reschedule_resources_from_down_agents method was develop to "Reschedule routers from down l3 agents if admin state is up", and this same method calls reschedule_resources_from_down_agents [2]. Basically, the problem ocurrs after '4 * agent_down_time' because the reschedule process starts '2 * agent_down_time' without agent communication [3], and try to reschedule until more 2 * agent_down_time [4]. Ofcouse this problem only occurs because the DRAgent has not established communication with Neutron server in the meantime of the reschedule process. The root cause was a RabbitMQ and/or MySQL issue, but it does make sense to me that the speaker is removed from the DRAgent, considering that the link between the BGP speaker and the DRAgent is an administrative action. Logs about the occurrence of the BUG below: For reference: agent_down_time = 1200 Oct 20 09:06:15 srv-neutron-server-container-9b119379.1a.ne.jaxyendy.com neutron-server[113]: 2023-10-20 09:06:15.563 113 WARNING neutron.db.agentschedulers_db [req-8e46c4f2-bfeb-4fe7-8726-ea3fd1d76392 - - - - -] Rescheduling bgp_speaker 6addb3dc-7282-4be2-9f40-b234563c1089 from agent 424ad86f-164c-454b-ad64-6f9f7ab2fb72 because the agent did not report to the server in the last 2400 seconds. Oct 20 09:07:05 srv-neutron-server-container-9b119379.1a.ne.jaxyendy.com neutron-server[113]: 2023-10-20 09:07:05.630 113 ERROR neutron.db.agentschedulers_db [req-8e46c4f2-bfeb-4fe7-8726-ea3fd1d76392 - - - - -] Exception encountered during bgp_speaker rescheduling.: oslo_db.exception.DBDeadlock: (pymysql.err.OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction')#012[SQL: DELETE FROM bgp_speaker_dragent_bindings WHERE bgp_speaker_dragent_bindings.bgp_speaker_id = %(bgp_speaker_id_1)s AND bgp_speaker_dragent_bindings.agent_id = %(agent_id_1)s]#012[parameters: {'bgp_speaker_id_1': '6addb3dc-7282-4be2-9f40-b234563c1089', 'agent_id_1': '424ad86f-164c-454b-ad64-6f9f7ab2fb72'}] Oct 20 09:16:15 srv-neutron-server-container-9b119379.1a.ne.jaxyendy.com neutron-server[113]: 2023-10-20 09:16:15.563 113 WARNING neutron.db.agentschedulers_db [req-8e46c4f2-bfeb-4fe7-8726-ea3fd1d76392 - - - - -] Rescheduling bgp_speaker 6addb3dc-7282-4be2-9f40-b234563c1089 from agent 424ad86f-164c-454b-ad64-6f9f7ab2fb72 because the agent did not report to the server in the last 2400 seconds. Oct 20 09:17:05 srv-neutron-server-container-9b119379.1a.ne.jaxyendy.com neutron-server[113]: 2023-10-20 09:17:05.588 113 ERROR neutron.db.agentschedulers_db [req-8e46c4f2-bfeb-4fe7-8726-ea3fd1d76392 - - - - -] Exception encountered during bgp_speaker rescheduling.: oslo_db.exception.DBDeadlock: (pymysql.err.OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction')#012[SQL: DELETE FROM bgp_speaker_dragent_bindings WHERE bgp_speaker_dragent_bindings.bgp_speaker_id = %(bgp_speaker_id_1)s AND bgp_speaker_dragent_bindings.agent_id = %(agent_id_1)s]#012[parameters: {'bgp_speaker_id_1': '6addb3dc-7282-4be2-9f40-b234563c1089', 'agent_id_1': '424ad86f-164c-454b-ad64-6f9f7ab2fb72'}] Oct 20 09:26:15 srv-neutron-server-container-9b119379.1a.ne.jaxyendy.com neutron-server[113]: 2023-10-20 09:26:15.572 113 WARNING neutron.db.agentschedulers_db [req-8e46c4f2-bfeb-4fe7-8726-ea3fd1d76392 - - - - -] Rescheduling bgp_speaker 6addb3dc-7282-4be2-9f40-b234563c1089 from agent 424ad86f-164c-454b-ad64-6f9f7ab2fb72 because the agent did not report to the server in the last 2400 seconds. Oct 20 09:27:05 srv-neutron-server-container-9b119379.1a.ne.jaxyendy.com neutron-server[113]: 2023-10-20 09:27:05.591 113 ERROR neutron.db.agentschedulers_db [req-8e46c4f2-bfeb-4fe7-8726-ea3fd1d76392 - - - - -] Exception encountered during bgp_speaker rescheduling.: oslo_db.exception.DBDeadlock: (pymysql.err.OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction')#012[SQL: DELETE FROM bgp_speaker_dragent_bindings WHERE bgp_speaker_dragent_bindings.bgp_speaker_id = %(bgp_speaker_id_1)s AND bgp_speaker_dragent_bindings.agent_id = %(agent_id_1)s]#012[parameters: {'bgp_speaker_id_1': '6addb3dc-7282-4be2-9f40-b234563c1089', 'agent_id_1': '424ad86f-164c-454b-ad64-6f9f7ab2fb72'}] .......... Oct 20 10:56:15 srv-neutron-server-container-9b119379.1a.ne.jaxyendy.com neutron-server[113]: 2023-10-20 10:56:15.758 113 WARNING neutron.db.agentschedulers_db [req-8e46c4f2-bfeb-4fe7-8726-ea3fd1d76392 - - - - -] Rescheduling bgp_speaker 6addb3dc-7282-4be2-9f40-b234563c1089 from agent 424ad86f-164c-454b-ad64-6f9f7ab2fb72 because the agent did not report to the server in the last 2400 seconds. Oct 20 10:57:05 srv-neutron-server-container-9b119379.1a.ne.jaxyendy.com neutron-server[113]: 2023-10-20 10:57:05.781 113 ERROR neutron.db.agentschedulers_db [req-8e46c4f2-bfeb-4fe7-8726-ea3fd1d76392 - - - - -] Exception encountered during bgp_speaker rescheduling.: oslo_db.exception.DBDeadlock: (pymysql.err.OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction')#012[SQL: DELETE FROM bgp_speaker_dragent_bindings WHERE bgp_speaker_dragent_bindings.bgp_speaker_id = %(bgp_speaker_id_1)s AND bgp_speaker_dragent_bindings.agent_id = %(agent_id_1)s]#012[parameters: {'bgp_speaker_id_1': '6addb3dc-7282-4be2-9f40-b234563c1089', 'agent_id_1': '424ad86f-164c-454b-ad64-6f9f7ab2fb72'}] After that, DRAgent was permanently removed from the BGP speaker. Hello everyone, I noticed a BUG in the Neutron agent reschedule process when I'm running the 'BGP dynamic routing agent'. I tried to found some n-d-r related documentation about the scheduler process for L3 agents, but I've just saw routers context topics [1]. The main problem is the FSM of the L3 reschedule method, since it was developed for router specifc purposes, and the n-d-r has a different context. To be clear, when this reschedule fails for the DRAgent, an administrative action (bgp_speaker association in a DRAgent) will be automatically removed from the Neutron configuration, and even if the agent comes back, it will no longer be associated with the speaker. The reschedule_resources_from_down_agents method was develop to "Reschedule routers from down l3 agents if admin state is up", and this same method calls reschedule_resources_from_down_agents [2]. Basically, the problem ocurrs after '4 * agent_down_time' because the reschedule process starts '2 * agent_down_time' without agent communication [3], and try to reschedule until more 2 * agent_down_time [4]. Ofcouse this problem only occurs because the DRAgent has not established communication with Neutron server in the meantime of the reschedule process. The root cause was a RabbitMQ and/or MySQL issue, but it does make sense to me that the speaker is removed from the DRAgent, considering that the link between the BGP speaker and the DRAgent is an administrative action. Logs about the occurrence of the BUG below: For reference: agent_down_time = 1200 Oct 20 09:06:15 srv-neutron-server-container-9b119379.1a.ne.jaxyendy.com neutron-server[113]: 2023-10-20 09:06:15.563 113 WARNING neutron.db.agentschedulers_db [req-8e46c4f2-bfeb-4fe7-8726-ea3fd1d76392 - - - - -] Rescheduling bgp_speaker 6addb3dc-7282-4be2-9f40-b234563c1089 from agent 424ad86f-164c-454b-ad64-6f9f7ab2fb72 because the agent did not report to the server in the last 2400 seconds. Oct 20 09:07:05 srv-neutron-server-container-9b119379.1a.ne.jaxyendy.com neutron-server[113]: 2023-10-20 09:07:05.630 113 ERROR neutron.db.agentschedulers_db [req-8e46c4f2-bfeb-4fe7-8726-ea3fd1d76392 - - - - -] Exception encountered during bgp_speaker rescheduling.: oslo_db.exception.DBDeadlock: (pymysql.err.OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction')#012[SQL: DELETE FROM bgp_speaker_dragent_bindings WHERE bgp_speaker_dragent_bindings.bgp_speaker_id = %(bgp_speaker_id_1)s AND bgp_speaker_dragent_bindings.agent_id = %(agent_id_1)s]#012[parameters: {'bgp_speaker_id_1': '6addb3dc-7282-4be2-9f40-b234563c1089', 'agent_id_1': '424ad86f-164c-454b-ad64-6f9f7ab2fb72'}] Oct 20 09:16:15 srv-neutron-server-container-9b119379.1a.ne.jaxyendy.com neutron-server[113]: 2023-10-20 09:16:15.563 113 WARNING neutron.db.agentschedulers_db [req-8e46c4f2-bfeb-4fe7-8726-ea3fd1d76392 - - - - -] Rescheduling bgp_speaker 6addb3dc-7282-4be2-9f40-b234563c1089 from agent 424ad86f-164c-454b-ad64-6f9f7ab2fb72 because the agent did not report to the server in the last 2400 seconds. Oct 20 09:17:05 srv-neutron-server-container-9b119379.1a.ne.jaxyendy.com neutron-server[113]: 2023-10-20 09:17:05.588 113 ERROR neutron.db.agentschedulers_db [req-8e46c4f2-bfeb-4fe7-8726-ea3fd1d76392 - - - - -] Exception encountered during bgp_speaker rescheduling.: oslo_db.exception.DBDeadlock: (pymysql.err.OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction')#012[SQL: DELETE FROM bgp_speaker_dragent_bindings WHERE bgp_speaker_dragent_bindings.bgp_speaker_id = %(bgp_speaker_id_1)s AND bgp_speaker_dragent_bindings.agent_id = %(agent_id_1)s]#012[parameters: {'bgp_speaker_id_1': '6addb3dc-7282-4be2-9f40-b234563c1089', 'agent_id_1': '424ad86f-164c-454b-ad64-6f9f7ab2fb72'}] Oct 20 09:26:15 srv-neutron-server-container-9b119379.1a.ne.jaxyendy.com neutron-server[113]: 2023-10-20 09:26:15.572 113 WARNING neutron.db.agentschedulers_db [req-8e46c4f2-bfeb-4fe7-8726-ea3fd1d76392 - - - - -] Rescheduling bgp_speaker 6addb3dc-7282-4be2-9f40-b234563c1089 from agent 424ad86f-164c-454b-ad64-6f9f7ab2fb72 because the agent did not report to the server in the last 2400 seconds. Oct 20 09:27:05 srv-neutron-server-container-9b119379.1a.ne.jaxyendy.com neutron-server[113]: 2023-10-20 09:27:05.591 113 ERROR neutron.db.agentschedulers_db [req-8e46c4f2-bfeb-4fe7-8726-ea3fd1d76392 - - - - -] Exception encountered during bgp_speaker rescheduling.: oslo_db.exception.DBDeadlock: (pymysql.err.OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction')#012[SQL: DELETE FROM bgp_speaker_dragent_bindings WHERE bgp_speaker_dragent_bindings.bgp_speaker_id = %(bgp_speaker_id_1)s AND bgp_speaker_dragent_bindings.agent_id = %(agent_id_1)s]#012[parameters: {'bgp_speaker_id_1': '6addb3dc-7282-4be2-9f40-b234563c1089', 'agent_id_1': '424ad86f-164c-454b-ad64-6f9f7ab2fb72'}] .......... Oct 20 10:56:15 srv-neutron-server-container-9b119379.1a.ne.jaxyendy.com neutron-server[113]: 2023-10-20 10:56:15.758 113 WARNING neutron.db.agentschedulers_db [req-8e46c4f2-bfeb-4fe7-8726-ea3fd1d76392 - - - - -] Rescheduling bgp_speaker 6addb3dc-7282-4be2-9f40-b234563c1089 from agent 424ad86f-164c-454b-ad64-6f9f7ab2fb72 because the agent did not report to the server in the last 2400 seconds. Oct 20 10:57:05 srv-neutron-server-container-9b119379.1a.ne.jaxyendy.com neutron-server[113]: 2023-10-20 10:57:05.781 113 ERROR neutron.db.agentschedulers_db [req-8e46c4f2-bfeb-4fe7-8726-ea3fd1d76392 - - - - -] Exception encountered during bgp_speaker rescheduling.: oslo_db.exception.DBDeadlock: (pymysql.err.OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction')#012[SQL: DELETE FROM bgp_speaker_dragent_bindings WHERE bgp_speaker_dragent_bindings.bgp_speaker_id = %(bgp_speaker_id_1)s AND bgp_speaker_dragent_bindings.agent_id = %(agent_id_1)s]#012[parameters: {'bgp_speaker_id_1': '6addb3dc-7282-4be2-9f40-b234563c1089', 'agent_id_1': '424ad86f-164c-454b-ad64-6f9f7ab2fb72'}] After that, DRAgent was permanently removed from the BGP speaker. [1] https://docs.openstack.org/neutron/latest/admin/config-services-agent.html#l3-agent [2] https://opendev.org/openstack/neutron/src/commit/53f4fd6b9fcb4f8ba907bfbace342bf902fc55f7/neutron/db/l3_agentschedulers_db.py#L65 [3] https://opendev.org/openstack/neutron/src/commit/f90d7d2a9e49688f7baee13268c3ba34f59d5a16/neutron/db/agentschedulers_db.py#L123 [4] https://opendev.org/openstack/neutron/src/commit/f90d7d2a9e49688f7baee13268c3ba34f59d5a16/neutron/db/agentschedulers_db.py#L146
2023-10-29 21:30:30 Miguel Lavalle neutron: importance Undecided Medium