OVN maintenance tasks may be delayed 10 minutes in the podified deployment
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
neutron |
Fix Released
|
Medium
|
Slawek Kaplonski |
Bug Description
When running Neutron server on the K8s (or OpenShift) cluster it may happen that ovn maintenance periodic tasks which are supposed to be run imediatelly are delayed for about 10 minutes. It is like when e.g. Neutron's configuration is changed and K8s is restarting neutron pods. What happens in such case is:
1. pods with neutron-api application are running,
2. configuration is updated and k8s is first starting new pods and after new ones are ready it terminates old pods,
3. during that time, neutron-server process which runs in the new pod is starting maintenance task and it immediately tries to run tasks defined with "periodics.
4. This new pod don't yet have lock to the ovn northbound db so each of such maintenance tasks is stopped immediately,
5. Few seconds later OLD neutron-server pod is terminated by k8s and then new pod (the one started above in point 3) got lock to the ovn database,
6. Now all maintenance tasks are run again by the maintenance worked after time defined in the "spacing" parameter which is 600 seconds. This 600 seconds is pretty long time to wait for e.g. some options in the ovn database will be adjusted to the new Neutron configuration.
We could reduce this spacing time to e.g. 5 seconds. This will decrease this additonal waiting time significantly in the case described in this bug. It would make all those methods to be called much more often in neutron-server processes which don't have lock granted but we may introduce additional parameter for that and e.g. raise NeverAgain() after 100 attempts of run such periodic task.
This is just a food for thought, perhaps we can also discuss this on a mailing list or the next PTG but maybe we should consider a way how to decouple the maintenance and periodic tasks out from Neutron in podified environments. There is always only one maintenance process active in the cluster, the one that holds the lock. With podified environment we can have just one pod running the tasks, avoiding locking and relying on the underlaying k8s functionality to take care of the pod lifecycle, meaning that there will always be one healthy pod in the cluster executing the periodic/ maintenance routines.
That way we would also solve the problem with this bug + perphas other potential issues that may raise up because of the podified nature.
What do you think?