Activity log for bug #2033894

Date Who What changed Old value New value Message
2023-09-01 12:36:23 Gregory Thiemonge bug added bug
2023-09-01 12:39:39 Gregory Thiemonge description There is an issue in amphorav1 when 2 amps of an A/S LB are missing and a failover is triggered, the failover can take between 15 and 40 min, depending on the retry/interval config in the worker. when recreating the 1st amphora, the failover flow also updates the 2nd amphora (it updates the VRRP config in all the amphorae): https://opendev.org/openstack/octavia/src/commit/8b196bb3bb05c9d87dae688750457dbb944d4d1b/octavia/controller/worker/v1/flows/amphora_flows.py#L290-L308 but if the 2nd amphora is also failing, those tasks skip the update after "timeout_dict" (but they are not failing, and they set the status of the amp to ERROR) however, AmphoraIndexVRRPStart has a few issues: - it calls amphora_driver.start_vrrp_service, this function skips non-ALLOCATED amp [1], but the task doesn't reload the DB object, if a previous task sets it in ERROR, it is still ALLOCATED in the object passed by taskflow (note: in v2, the object is reloaded from the DB) - it passes timeout_dict to amphora_driver.start_vrrp_service but start_vrrp_service doesn't pass it to _populate_amphora_api_version [2], in case the amp is missing, the shorted timeout in timeout_dict is not used (the function timeouts after ~15 min in devstack) [1] https://opendev.org/openstack/octavia/src/commit/8b196bb3bb05c9d87dae688750457dbb944d4d1b/octavia/amphorae/drivers/keepalived/vrrp_rest_driver.py#L84 [2] https://opendev.org/openstack/octavia/src/commit/8b196bb3bb05c9d87dae688750457dbb944d4d1b/octavia/amphorae/drivers/keepalived/vrrp_rest_driver.py#L91 We can fix both issues but there's a more interesting lead the failover flow invokes 3 successive tasks: - AmphoraIndexUpdateVRRPInterface - AmphoraIndexVRRPUpdate - AmphoraIndexVRRPStart AmphoraIndexUpdateVRRPInterface provides an 'amp_vrrp_int' that is not null when it succeeds. when the amphora is not reachable, amp_vrrp_int is None [3] This value can be passed to AmphoraIndexVRRPUpdate and AmphoraIndexVRRPStart, and we can skip the execution of the functions if it is None, it would prevent the amphora driver from attempting to connect to a missing amphora. [3] https://opendev.org/openstack/octavia/src/commit/8b196bb3bb05c9d87dae688750457dbb944d4d1b/octavia/controller/worker/v1/tasks/amphora_driver_tasks.py#L321 There is an issue in amphorav1 when 2 amps of an A/S LB are missing and a failover is triggered, the failover can take between 15 and 40 min, depending on the retry/interval config in the worker. when recreating the 1st amphora, the failover flow also updates the 2nd amphora (it updates the VRRP config in all the amphorae): https://opendev.org/openstack/octavia/src/commit/8b196bb3bb05c9d87dae688750457dbb944d4d1b/octavia/controller/worker/v1/flows/amphora_flows.py#L290-L308 but if the 2nd amphora is also failing, those tasks skip the update after "timeout_dict" (but they are not failing, and they set the status of the amp to ERROR) however, AmphoraIndexVRRPStart has a few issues: - it calls amphora_driver.start_vrrp_service, this function skips non-ALLOCATED amp [1], but the task doesn't reload the DB object, if a previous task sets it in ERROR, it is still ALLOCATED in the object passed by taskflow (note: in v2, the object is reloaded from the DB) - it passes timeout_dict to amphora_driver.start_vrrp_service but start_vrrp_service doesn't pass it to _populate_amphora_api_version [2], in case the amp is missing, the short timeout in timeout_dict is not used (the function timeouts after ~15 min in devstack instead of 2 or 3 min) [1] https://opendev.org/openstack/octavia/src/commit/8b196bb3bb05c9d87dae688750457dbb944d4d1b/octavia/amphorae/drivers/keepalived/vrrp_rest_driver.py#L84 [2] https://opendev.org/openstack/octavia/src/commit/8b196bb3bb05c9d87dae688750457dbb944d4d1b/octavia/amphorae/drivers/keepalived/vrrp_rest_driver.py#L91 We can fix both issues but there's a more interesting lead the failover flow invokes 3 successive tasks: - AmphoraIndexUpdateVRRPInterface - AmphoraIndexVRRPUpdate - AmphoraIndexVRRPStart AmphoraIndexUpdateVRRPInterface provides an 'amp_vrrp_int' that is not null when it succeeds. when the amphora is not reachable, amp_vrrp_int is None [3] This value can be passed to AmphoraIndexVRRPUpdate and AmphoraIndexVRRPStart, and we can skip the execution of the functions if it is None, it would prevent the amphora driver from attempting to connect to a missing amphora. [3] https://opendev.org/openstack/octavia/src/commit/8b196bb3bb05c9d87dae688750457dbb944d4d1b/octavia/controller/worker/v1/tasks/amphora_driver_tasks.py#L321
2023-09-01 13:27:07 Gregory Thiemonge octavia: importance Undecided Medium
2023-09-01 13:27:07 Gregory Thiemonge octavia: status New Confirmed
2023-09-01 13:27:07 Gregory Thiemonge octavia: assignee Gregory Thiemonge (gthiemonge)
2023-09-01 15:37:26 OpenStack Infra octavia: status Confirmed In Progress
2023-09-06 20:55:03 Thobias Trevisan bug added subscriber Thobias Trevisan
2023-09-06 21:08:59 Felipe Alencastro bug added subscriber Felipe Alencastro
2023-10-09 15:35:30 OpenStack Infra octavia: status In Progress Fix Released
2023-10-16 09:16:48 OpenStack Infra tags in-stable-zed
2023-10-16 09:16:57 OpenStack Infra tags in-stable-zed in-stable-yoga in-stable-zed
2023-10-16 09:17:05 OpenStack Infra tags in-stable-yoga in-stable-zed in-stable-xena in-stable-yoga in-stable-zed
2023-10-16 09:17:14 OpenStack Infra tags in-stable-xena in-stable-yoga in-stable-zed in-stable-wallaby in-stable-xena in-stable-yoga in-stable-zed