2023-09-01 12:39:39 |
Gregory Thiemonge |
description |
There is an issue in amphorav1 when 2 amps of an A/S LB are missing and a failover is triggered, the failover can take between 15 and 40 min, depending on the retry/interval config in the worker.
when recreating the 1st amphora, the failover flow also updates the 2nd amphora (it updates the VRRP config in all the amphorae):
https://opendev.org/openstack/octavia/src/commit/8b196bb3bb05c9d87dae688750457dbb944d4d1b/octavia/controller/worker/v1/flows/amphora_flows.py#L290-L308
but if the 2nd amphora is also failing, those tasks skip the update after "timeout_dict" (but they are not failing, and they set the status of the amp to ERROR)
however, AmphoraIndexVRRPStart has a few issues:
- it calls amphora_driver.start_vrrp_service, this function skips non-ALLOCATED amp [1], but the task doesn't reload the DB object, if a previous task sets it in ERROR, it is still ALLOCATED in the object passed by taskflow (note: in v2, the object is reloaded from the DB)
- it passes timeout_dict to amphora_driver.start_vrrp_service but start_vrrp_service doesn't pass it to _populate_amphora_api_version [2], in case the amp is missing, the shorted timeout in timeout_dict is not used (the function timeouts after ~15 min in devstack)
[1] https://opendev.org/openstack/octavia/src/commit/8b196bb3bb05c9d87dae688750457dbb944d4d1b/octavia/amphorae/drivers/keepalived/vrrp_rest_driver.py#L84
[2] https://opendev.org/openstack/octavia/src/commit/8b196bb3bb05c9d87dae688750457dbb944d4d1b/octavia/amphorae/drivers/keepalived/vrrp_rest_driver.py#L91
We can fix both issues but there's a more interesting lead
the failover flow invokes 3 successive tasks:
- AmphoraIndexUpdateVRRPInterface
- AmphoraIndexVRRPUpdate
- AmphoraIndexVRRPStart
AmphoraIndexUpdateVRRPInterface provides an 'amp_vrrp_int' that is not null when it succeeds.
when the amphora is not reachable, amp_vrrp_int is None [3]
This value can be passed to AmphoraIndexVRRPUpdate and AmphoraIndexVRRPStart, and we can skip the execution of the functions if it is None, it would prevent the amphora driver from attempting to connect to a missing amphora.
[3] https://opendev.org/openstack/octavia/src/commit/8b196bb3bb05c9d87dae688750457dbb944d4d1b/octavia/controller/worker/v1/tasks/amphora_driver_tasks.py#L321 |
There is an issue in amphorav1 when 2 amps of an A/S LB are missing and a failover is triggered, the failover can take between 15 and 40 min, depending on the retry/interval config in the worker.
when recreating the 1st amphora, the failover flow also updates the 2nd amphora (it updates the VRRP config in all the amphorae):
https://opendev.org/openstack/octavia/src/commit/8b196bb3bb05c9d87dae688750457dbb944d4d1b/octavia/controller/worker/v1/flows/amphora_flows.py#L290-L308
but if the 2nd amphora is also failing, those tasks skip the update after "timeout_dict" (but they are not failing, and they set the status of the amp to ERROR)
however, AmphoraIndexVRRPStart has a few issues:
- it calls amphora_driver.start_vrrp_service, this function skips non-ALLOCATED amp [1], but the task doesn't reload the DB object, if a previous task sets it in ERROR, it is still ALLOCATED in the object passed by taskflow (note: in v2, the object is reloaded from the DB)
- it passes timeout_dict to amphora_driver.start_vrrp_service but start_vrrp_service doesn't pass it to _populate_amphora_api_version [2], in case the amp is missing, the short timeout in timeout_dict is not used (the function timeouts after ~15 min in devstack instead of 2 or 3 min)
[1] https://opendev.org/openstack/octavia/src/commit/8b196bb3bb05c9d87dae688750457dbb944d4d1b/octavia/amphorae/drivers/keepalived/vrrp_rest_driver.py#L84
[2] https://opendev.org/openstack/octavia/src/commit/8b196bb3bb05c9d87dae688750457dbb944d4d1b/octavia/amphorae/drivers/keepalived/vrrp_rest_driver.py#L91
We can fix both issues but there's a more interesting lead
the failover flow invokes 3 successive tasks:
- AmphoraIndexUpdateVRRPInterface
- AmphoraIndexVRRPUpdate
- AmphoraIndexVRRPStart
AmphoraIndexUpdateVRRPInterface provides an 'amp_vrrp_int' that is not null when it succeeds.
when the amphora is not reachable, amp_vrrp_int is None [3]
This value can be passed to AmphoraIndexVRRPUpdate and AmphoraIndexVRRPStart, and we can skip the execution of the functions if it is None, it would prevent the amphora driver from attempting to connect to a missing amphora.
[3] https://opendev.org/openstack/octavia/src/commit/8b196bb3bb05c9d87dae688750457dbb944d4d1b/octavia/controller/worker/v1/tasks/amphora_driver_tasks.py#L321 |
|