Subcloud install fails using the Redfish Virtual Media service with RVMC pod in pending state

Bug #1968183 reported by Enzo Candotti
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Enzo Candotti

Bug Description

SC installation fails with error:

TASK [Fail if rvmc-subcloud2001-4wzfd is not ready] ****************************
Monday 04 April 2022 17:19:40 +0000 (0:00:00.026) 0:01:02.579 **********
fatal: [subcloud2001 -> localhost]: FAILED! => changed=false
msg: Redfish Virtual Media Controller failed to start the install

Type Reason Age From Message

  ---- ------ ---- ---- -------

  Warning FailedScheduling 43s (x13 over 14m) default-scheduler 0/8 nodes are available: 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 6 node(s) didn't match Pod's node affinity/selector.

 Steps to Reproduce

After completing a STD DC-system-controller install, attempt to install SC1 and Sc2 using the remote install failed.

Expected Behavior

SC can be installed using the remote install option.

Actual Behavior

Installation fails with error:
TASK [Get rvmc_namespace] ******************************************************
Monday 04 April 2022 17:08:59 +0000 (0:00:00.015) 0:00:00.355 **********
changed: [subcloud1 -> localhost]

TASK [Ensure rvmc_namespace is created] ****************************************
Monday 04 April 2022 17:08:59 +0000 (0:00:00.157) 0:00:00.513 **********
changed: [subcloud1 -> localhost]

TASK [Get default registry key] ************************************************
Monday 04 April 2022 17:09:00 +0000 (0:00:00.157) 0:00:00.670 **********
changed: [subcloud1 -> localhost]

TASK [Copy default-registry-key to rvmc namespace] *****************************
Monday 04 April 2022 17:09:00 +0000 (0:00:00.174) 0:00:00.844 **********
changed: [subcloud1 -> localhost]

TASK [Create Redfish Virtual Media Controller resource file] *******************
Monday 04 April 2022 17:09:00 +0000 (0:00:00.279) 0:00:01.124 **********
changed: [subcloud1 -> localhost]

TASK [Activate Redfish Virtual Media Controller] *******************************
Monday 04 April 2022 17:09:00 +0000 (0:00:00.347) 0:00:01.471 **********
changed: [subcloud1 -> localhost]

TASK [Get the pod name that created by Redfish Virtual Media Controller batch job] ***
Monday 04 April 2022 17:09:01 +0000 (0:00:00.276) 0:00:01.748 **********
changed: [subcloud1 -> localhost]

TASK [set_fact] ****************************************************************
Monday 04 April 2022 17:09:01 +0000 (0:00:00.164) 0:00:01.913 **********
ok: [subcloud1 -> localhost]

TASK [Wait for 60 seconds for rvmc-subcloud1-ldwwh to be ready] ****************
Monday 04 April 2022 17:09:01 +0000 (0:00:00.047) 0:00:01.960 **********
changed: [subcloud1 -> localhost]

TASK [Save Redfish Virtual Media Controller logs if rvmc-subcloud1-ldwwh is not ready] ***
Monday 04 April 2022 17:10:01 +0000 (0:01:00.186) 0:01:02.147 **********
changed: [subcloud1 -> localhost]

TASK [debug] *******************************************************************
Monday 04 April 2022 17:10:01 +0000 (0:00:00.183) 0:01:02.331 **********
ok: [subcloud1 -> localhost] =>
msg: ''

TASK [Fail if rvmc-subcloud1-ldwwh is not ready] *******************************
Monday 04 April 2022 17:10:01 +0000 (0:00:00.026) 0:01:02.357 **********
fatal: [subcloud1 -> localhost]: FAILED! => changed=false
msg: Redfish Virtual Media Controller failed to start the install

Reproducibility

100% On DC with Standard system controllers.

System Configuration

DC with Standard system controllers, tested remote -install on 2 Subclouds
SW_VERSION="22.02"

Branch and the time when code was pulled or git commit or cengn load info

Last Pass
New test scenario.

Timestamp/Logs

Provide a snippet of logs if available and the timestamp when issue was seen.
TASK [Fail if rvmc-subcloud1-ldwwh is not ready] *******************************
Monday 04 April 2022 17:10:01 +0000 (0:00:00.026) 0:01:02.357 **********
fatal: [subcloud1 -> localhost]: FAILED! => changed=false
msg: Redfish Virtual Media Controller failed to start the install
Please indicate the unique identifier in the logs to highlight the problem

Alarms

-

Test Activity

Developer Testing

Workaround

Delete taint on master node:
kubectl taint nodes controller-0 node-role.kubernetes.io/master:NoSchedule-

After removing this taint the subcloud installation has to be re-applied.

Changed in starlingx:
assignee: nobody → Enzo Candotti (ecandotti)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (master)
Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/c/starlingx/ansible-playbooks/+/836987
Committed: https://opendev.org/starlingx/ansible-playbooks/commit/5b20dbef0a1a4941af04e14728ebc6f8921a816e
Submitter: "Zuul (22348)"
Branch: master

commit 5b20dbef0a1a4941af04e14728ebc6f8921a816e
Author: Enzo Candotti <email address hidden>
Date: Thu Apr 7 11:46:53 2022 -0300

    Add NoSchedule tolerations to rvmc pod

    Add a toleration for the node-role.kubernetes.io/master:NoSchedule
    taint. This taint is restored on this review
    https://review.opendev.org/c/starlingx/ansible-playbooks/+/811978 to all
    standard (non-AIO) master nodes to prevent user pods from being
    scheduled and run. Therefore, these workloads will be scheduled and
    run on a worker node.

    This change will ensure that the rvmc pod will continue run on the
    master nodes (as designed).

    Test plan:
    PASS: Install rvmc pod with taint enabled on a STD system
    and verify that the pod is running.
    PASS: Describe rvmc pod and verify that toleration is added:
    Tolerations: node-role.kubernetes.io/master:NoSchedule op=Exists

    Closes-bug: 1968183

    Signed-off-by: Enzo Candotti <email address hidden>
    Change-Id: I12ae4182967848d6acf9d7c0c99983b5be57f539

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.7.0 stx.config stx.distcloud
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.