StarlingX

NodeAffinity recovery logic triggered too soon

Bug #1877452 reported by Frank Miller on 2020-05-07

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Invalid	Medium	Bob Church

Bug Description

Brief Description
-----------------
Kubernetes has a known issue where pod recovery on host reboots sometimes results in pods stuck in MatchNodeSelector or NodeAffinity status [1]. The StarlingX commit to recover from this is triggered early after a host reboot such that it can impact normal recovery.

[1] https://github.com/kubernetes/kubernetes/pull/80976

Severity
--------
Major

Steps to Reproduce
------------------
Reboot a controller host.

Expected Behavior
------------------
All pods should recover and come up in a running state. No pods should remain in a MatchNodeSelector or NodeAffinity state. In addition pods should have a chance to fully transition to a running state before any recovery action is taken.

Actual Behavior
----------------
The check for pods in the MatchNodeSelector or NodeAffinity state is done as soon as the k8s conductor comes up. This sometimes is causing pods to be recovered before they have had a chance to attempt their normal initialization.

Reproducibility
---------------
Intermittent

System Configuration
--------------------
All configs

Branch/Pull Time/Commit
-----------------------
Seen on loads built on April 20 or later

Last Pass
---------
Before the upversion to k8s v1.18

Timestamp/Logs
--------------
n/a

Test Activity
-------------
Regression

Workaround
----------
n/a

Tags:

Ghada Khalil (gkhalil) on 2020-05-07

tags:

added: stx.4.0 stx.containers

Frank Miller (sensfan22) on 2020-05-07

Changed in starlingx:
assignee:	nobody → Bob Church (rchurch)

Revision history for this message

Frank Miller (sensfan22) wrote on 2020-05-08:

Looking closer, the algorithm does wait 1 minute before starting and current data suggests the pods in NodeAffinity state are in failed state so recovering them at any time makes senses.

Changed in starlingx:
status:	New → Invalid

Ghada Khalil (gkhalil) on 2020-05-08

Changed in starlingx:
importance:	Undecided → Medium

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.