StarlingX

Bug #1900920
Comment #2

Comment 2 for bug 1900920

Revision history for this message

Douglas Henrique Koerich (dkoerich-wr) wrote on 2021-03-11:

By following the steps indicated in the bug description above it was possible to reproduce the issue in an AIO-DX environment, according to the following timeline (at the host the pod(s) was(were) scheduled on):

t=0s: Finished controller manifest
t=8s: Started worker manifest
t=37s: Start of k8s-pod-recovery
t=38s: Finished worker manifest
t=63s: Started created "restart-on-reboot" labeled pod(s)
t=281s: Same labeled pod(s) verified w/o restarting

The restart of the pod(s) is not performed because the query on the labeled pods to be recovered returns an empty set when the k8s-pod-recovery is launched.

By moving the handling of labeled pods to after they are in a stable state, the restart of them is correctly performed:

t=0s: Finished controller manifest
t=9s: Started worker manifest
t=66s: Start of k8s-pod-recovery
t=67s: Finished worker manifest
t=73s: Started created "restart-on-reboot" labeled pod(s)
t=190s: Labeled pod(s) is(are) restarted
t=408s: New labeled pod(s) verified