Comment 4 for bug 1896631

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to integ (master)

Reviewed: https://review.opendev.org/753410
Committed: https://git.openstack.org/cgit/starlingx/integ/commit/?id=77562993032f89da5295ee9e4141f1665fbb5e9f
Submitter: Zuul
Branch: master

commit 77562993032f89da5295ee9e4141f1665fbb5e9f
Author: Steven Webster <email address hidden>
Date: Tue Sep 22 12:25:32 2020 -0400

    Enable pod restart based on a label

    This commit adds a mechanism to the pod recovery service to restart
    pods based on the restart-on-reboot label.

    This is a mitigation for an issue seen on an AIO system using SR-IOV
    interfaces on an N3000 FPGA device. Since the kubernetes services
    start coming up after the controller manifest has completed, a race
    can happen with the configuration of devices and the SR-IOV device
    plugin in the worker manifest. The symptom of this would be the
    SR-IOV device in the running pod disappearing as the FPGA device is
    reset.

    Notes:

    - The pod recovery service only runs on controller nodes.
    - The raciness between the kubernetes bring-up and worker configuration
      should be fixed in the future by a re-organization of the manifests to
      either have a separate AIO or kubernetes manifest. This would require
      extensive feature work. In the meantime, this mitigation will allow
      pods which experience this issue to recover.

    Change-Id: If84b66b3a632752bd08293105bb780ea8c7cf400
    Closes-Bug: #1896631
    Signed-off-by: Steven Webster <email address hidden>