StarlingX

Reinstall of worker node results in Configuration failure. "Could not find command 'configure'"

Bug #1823396 reported by Wendy Mitchell on 2019-04-05

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	High	Al Bailey

Bug Description

Brief Description
-----------------
Host Reinstall (worker node) failed to apply puppet manifest
Configuration failure, threshold reached

Severity
--------
standard

Steps to Reproduce
------------------
1. worker nodes was running prior to this test
2. reinstall initiated
3. attempt to unlock after worker node installed and online

Expected Behavior
------------------
reinstalls, runs puppet manifest without failures

Actual Behavior
----------------
reinstalled but failed to successfully unlock

compute-2 Worker Unlocked Disabled Failed 2 minutes
Configuration failure, threshold reached, Lock/Unlock to retry

[ 101.778238] worker_config[21339]: [WARNING]
[ 101.783196] worker_config[21339]: Warnings found. See /var/log/puppet/2019-0
-05-18-44-30_worker/puppet.log for details
[ 101.796110] worker_config[21339]: ******************************************
**********
[ 101.805131] worker_config[21339]: ******************************************
**********
[ 101.814059] worker_config[21339]: Failed to run the puppet manifest (RC:1)
[ 101.823052] worker_config[21339]: ******************************************
**********
[ 101.832051] worker_config[21339]: ******************************************
**********

Reproducibility
---------------
yes

System Configuration
--------------------
2+3
(Lab: wp_3-7)

Branch/Pull Time/Commit
--------------------
BUILD_ID="20190405T013000Z
BUILD_ID="20190405T013000Z

Last Pass

---------

Timestamp/Logs
--------------
see attached puppet.log output
~
2019-04-05T18:45:06.338 Error: 2019-04-05 18:45:06 +0000 /Stage[main]/Platform:
Kubernetes::Worker::Init/Exec[configure worker node]/returns: change from notru
to 0 failed: Could not find command 'configure

Test Activity
-------------
[Platform pinning Feature Testing]

Tags:

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-04-05:

puppetfailure.txt Edit (50.9 KiB, text/plain)

Revision history for this message

Bart Wensley (bartwensley) wrote on 2019-04-05:

This problem is almost certainly happening because the re-install will cause the worker node to attempt to join the cluster (with kubadm join). This will fail because the node still exists in kubernetes. The solution is likely to have sysinv delete the node when the re-install is done. We are already doing this when the node is deleted.

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-04-08:

Marking as release gating; issue prevents node re-installs

tags:	added: stx.2.0 stx.containers
Changed in starlingx:
importance:	Undecided → High
status:	New → Triaged
assignee:	nobody → Bart Wensley (bartwensley)
assignee:	Bart Wensley (bartwensley) → Kevin Smith (kevin.smith.wrs)

Ghada Khalil (gkhalil) on 2019-04-08

tags:

added: stx.retestneeded

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-04-11: Fix proposed to stx-nfv (master)

Fix proposed to branch: master
Review: https://review.openstack.org/651754

Changed in starlingx:
status:	Triaged → In Progress

Frank Miller (sensfan22) on 2019-05-01

Changed in starlingx:
assignee:	Kevin Smith (kevin.smith.wrs) → Al Bailey (albailey1974)

Revision history for this message

Al Bailey (albailey1974) wrote on 2019-05-01:

Review 651754 on April 11 has nothing to do with this bug. It was for bug: 1824027

For this bug:
It looks as though the "$join_cmd" is cleared in hiera, therefore the puppet code raises an error.

To steps to "reinstall" are:
system host-lock compute-0
system host-reinstall compute-0

I am proceeding with those steps, to verify the bug.

Revision history for this message

Al Bailey (albailey1974) wrote on 2019-05-01:

Tested the fix by doing the following
system host-lock compute-0
system host-reinstall compute-0
system host-unlock compute-0

Note: it is possible (but unsupported) to reinstall an unlocked host through a network boot (F12).
This will almost always still fail, because the token expires after 24 hours, and so the attempt
to join kubernetes will fail. The fix in that situation is to lock/unlock the compute.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-05-01: Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/656701

Revision history for this message

Al Bailey (albailey1974) wrote on 2019-05-02:

Verified the behaviour when a (unsupported) network reinstall is performed on a worker that has been running for 24 hours (ie: its join token is no longer valid), rather than using the (supported) sysinv reinstall.

The compute will reinstall and boot "unlocked / disabled / failed" due to a puppet error related to the expired token.

2019-05-02T14:52:07.047 ^[[mNotice: 2019-05-02 14:52:06 +0000 /Stage[main]/Platform::Kubernetes::Worker::Init/Exec[configure worker node]/returns: unable to fetch the kubeadm-config ConfigMap: failed to get config map: Unauthorized^[[0m
2019-05-02T14:52:07.068 ^[[1;31mError: 2019-05-02 14:52:06 +0000 kubeadm join 192.168.206.2:6443 --token n2kcid.cu64096tbdj0p7dh --discovery-token-ca-cert-hash
sha256:db1bd55af2166b72f1ba0442d5405f45cdc89f3ac6dccf47f8e25bc8501ffeaf
2019-05-02T14:52:07.071 returned 1 instead of one of [0]

Verified this worker can be recovered by issuing lock/unlock.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-05-03: Fix merged to config (master)

Reviewed: https://review.opendev.org/656701
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=2a9b2df7c08255b5b21e10a0ff9118e6a7427d06
Submitter: Zuul
Branch: master

commit 2a9b2df7c08255b5b21e10a0ff9118e6a7427d06
Author: Al Bailey <email address hidden>
Date: Wed May 1 14:40:56 2019 -0500

Fix configuration failure when reinstalling worker node

When doing a system host-reinstall, the unlock of the worker would
fail with an error in the puppet logs.

    This is because reinstall results in there not being any kubernetes
    configuration, and therefore a kubernetes join is expected, however
    the hieradata with that information was only being provided on an
    initial creation of the node.

The fix was to always provide the token and command for the
kubernetes join command.

This fixes a host reinstall through sysinv.
(lock / reinstall / unlock)

    An unsupported reinstall through a network boot (F12) will
    typically fail since the token expires after 24 hours.
    However that scenario can be remedied by a lock/unlock.

    Change-Id: Idea9f1e4b8f98203260ec0b6af7ae29c34579b86
    Fixes-Bug: 1823396
    Signed-off-by: Al Bailey <email address hidden>

Changed in starlingx:
status:	In Progress → Fix Released

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-05-28:

#10

verified reinstalls after lock worker host,
unlock confirms transition to in-test state then Enabling status.
verified on worker node on 2+3 system.
yow-cgcs-wildcat-71_75 2019-05-27

controller-0 VIM_Thread[1584008] INFO _host.py.820 Host compute-2 FSM State-Change: prev_state=enabling, state=enabled, event=task-completed.

tags:

removed: stx.retestneeded