This commit adds a mechanism to the pod recovery service to restart
pods based on the restart-on-reboot label.
This is a mitigation for an issue seen on an AIO system using SR-IOV
interfaces on an N3000 FPGA device. Since the kubernetes services
start coming up after the controller manifest has completed, a race
can happen with the configuration of devices and the SR-IOV device
plugin in the worker manifest. The symptom of this would be the
SR-IOV device in the running pod disappearing as the FPGA device is
reset.
Notes:
- The pod recovery service only runs on controller nodes.
- The raciness between the kubernetes bring-up and worker configuration
should be fixed in the future by a re-organization of the manifests to
either have a separate AIO or kubernetes manifest. This would require
extensive feature work. In the meantime, this mitigation will allow
pods which experience this issue to recover.
Change-Id: If84b66b3a632752bd08293105bb780ea8c7cf400
Closes-Bug: #1896631
Signed-off-by: Steven Webster <email address hidden>
Reviewed: https:/ /review. opendev. org/753410 /git.openstack. org/cgit/ starlingx/ integ/commit/ ?id=77562993032 f89da5295ee9e41 41f1665fbb5e9f
Committed: https:/
Submitter: Zuul
Branch: master
commit 77562993032f89d a5295ee9e4141f1 665fbb5e9f
Author: Steven Webster <email address hidden>
Date: Tue Sep 22 12:25:32 2020 -0400
Enable pod restart based on a label
This commit adds a mechanism to the pod recovery service to restart
pods based on the restart-on-reboot label.
This is a mitigation for an issue seen on an AIO system using SR-IOV
interfaces on an N3000 FPGA device. Since the kubernetes services
start coming up after the controller manifest has completed, a race
can happen with the configuration of devices and the SR-IOV device
plugin in the worker manifest. The symptom of this would be the
SR-IOV device in the running pod disappearing as the FPGA device is
reset.
Notes:
- The pod recovery service only runs on controller nodes.
- The raciness between the kubernetes bring-up and worker configuration
should be fixed in the future by a re-organization of the manifests to
either have a separate AIO or kubernetes manifest. This would require
extensive feature work. In the meantime, this mitigation will allow
pods which experience this issue to recover.
Change-Id: If84b66b3a63275 2bd08293105bb78 0ea8c7cf400
Closes-Bug: #1896631
Signed-off-by: Steven Webster <email address hidden>