NodeAffinity recovery logic triggered too soon

Bug #1877452 reported by Frank Miller
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Invalid
Medium
Bob Church

Bug Description

Brief Description
-----------------
Kubernetes has a known issue where pod recovery on host reboots sometimes results in pods stuck in MatchNodeSelector or NodeAffinity status [1]. The StarlingX commit to recover from this is triggered early after a host reboot such that it can impact normal recovery.

[1] https://github.com/kubernetes/kubernetes/pull/80976

Severity
--------
Major

Steps to Reproduce
------------------
Reboot a controller host.

Expected Behavior
------------------
All pods should recover and come up in a running state. No pods should remain in a MatchNodeSelector or NodeAffinity state. In addition pods should have a chance to fully transition to a running state before any recovery action is taken.

Actual Behavior
----------------
The check for pods in the MatchNodeSelector or NodeAffinity state is done as soon as the k8s conductor comes up. This sometimes is causing pods to be recovered before they have had a chance to attempt their normal initialization.

Reproducibility
---------------
Intermittent

System Configuration
--------------------
All configs

Branch/Pull Time/Commit
-----------------------
Seen on loads built on April 20 or later

Last Pass
---------
Before the upversion to k8s v1.18

Timestamp/Logs
--------------
n/a

Test Activity
-------------
Regression

Workaround
----------
n/a

Ghada Khalil (gkhalil)
tags: added: stx.4.0 stx.containers
Frank Miller (sensfan22)
Changed in starlingx:
assignee: nobody → Bob Church (rchurch)
Revision history for this message
Frank Miller (sensfan22) wrote :

Looking closer, the algorithm does wait 1 minute before starting and current data suggests the pods in NodeAffinity state are in failed state so recovering them at any time makes senses.

Changed in starlingx:
status: New → Invalid
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.