pods do not get restarted in an AIO-DX system

Bug #1900920 reported by Steven Webster on 2020-10-21
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Medium
Cole Walker

Bug Description

Brief Description
-----------------

Pods that are in a k8s deployment, daemonset, etc can be labeled as restart-on-reboot="true", which will automatically cause them to be restarted after the worker manifest has completed in an AIO system. This label is primarily used for pods using SR-IOV interfaces, as the pod will start coming up after the controller manifest is completed, but before the SR-IOV devices are bound with an appropriate driver.

In an AIO-DX system however, the reboot can fail to occur if no node selector has been set, as the query for labeled pods depends on a field selector specifying the host the recovery script is running on.

The reboot will fail to occur if the script looks for labeled pods before the pod has been scheduled on the node the script is running on.

Severity
--------
Provide the severity of the defect.
Minor: System/Feature is usable with minor issue

Steps to Reproduce
------------------
- As part of a daemonset, label a pod with restart-on-reboot=true
- Ensure the pod cannot be scheduled on the other AIO-DX node (label, taint, etc)
- Reboot the node the pod is scheduled on and observe the k8s-pod-recovery logs in /var/log/daemon.log
- Observe no log specifying the pod has been recovered

Expected Behavior
------------------
The pod should be recovered by the script

Actual Behavior
----------------
The pod may not be recovered by the script

Reproducibility
---------------
50/50

System Configuration
--------------------
AIO-DX

Branch/Pull Time/Commit
-----------------------
master 2020-10-20

Test Activity
-------------
Developer Testing

Workaround
----------
Use an init container for the pod in question with a few second delay
or
restart the pod manually

tags: added: stx.networking
Ghada Khalil (gkhalil) on 2020-10-29
Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
tags: added: stx.5.0
Changed in starlingx:
assignee: nobody → Cole Walker (cwalops)
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers