worker nodes applying puppet manifest before unlock

Bug #1853329 reported by Joseph Richard
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Joseph Richard

Bug Description

Brief Description
-----------------
Puppet manifests are being applied on a worker node before it is unlocked. The hieradata appears to have been generated by the sriov interface configuration, before the mgmt interface is configured.

Severity
--------
<Major: System/Feature is usable but degraded>

Steps to Reproduce
------------------
Bring up system
configure sriov on worker nodes before management

Expected Behavior
------------------
Puppet manifest on worker nodes will not apply until after unlock

Actual Behavior
----------------
Puppet manifest on worker nodes applies before unlock.
This results in the unlock failing and the system being unusable.

Reproducibility
---------------
<Reproducible/Intermittent/Seen once>
This is consistently reproducible on one lab

System Configuration
--------------------
Multi-node with sriov on worker nodes

Branch/Pull Time/Commit
-----------------------
###
### StarlingX
### Built from master
###

OS="centos"
SW_VERSION="19.09"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="20191119T000000Z"

JOB="STX_build_master_master"
<email address hidden>"
BUILD_NUMBER="324"
BUILD_HOST="starlingx_mirror"
BUILD_DATE="2019-11-19 00:00:00 +0000"

Last Pass
---------
Not previously tested with the same config in that lab.

Timestamp/Logs
--------------
Attach the logs for debugging (use attachments in Launchpad or paste.openstack.org)
Provide a snippet of logs here and the timestamp when issue was seen.
Please indicate the unique identifier in the logs to highlight the problem

Revision history for this message
Joseph Richard (josephrichard) wrote :

Added logs for controllers. compute node logs aren't collected, as they failed to come up and became unreachable due to this bug. Most relevant information is hieradata for worker nodes.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.3.0 / high priority - worker nodes don't recover if sriov i/fs are configured first

description: updated
Changed in starlingx:
assignee: nobody → Joseph Richard (josephrichard)
importance: Undecided → High
status: New → Triaged
tags: added: stx.3.0 stx.config stx.networking
Revision history for this message
Matt Peters (mpeters-wrs) wrote :

The issue is being caused by the changes made under the following commit.
https://opendev.org/starlingx/config/commit/f8fc051a9bc49474251cb475bb36654174edf643

Revision history for this message
Ghada Khalil (gkhalil) wrote :

This issue can be avoided by configuring the interfaces in order. Lowering the priority and moving out of stx.3.0

tags: added: stx.4.0
removed: stx.3.0
Changed in starlingx:
importance: High → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/698816

Changed in starlingx:
status: Triaged → In Progress
Ghada Khalil (gkhalil)
description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on config (master)

Change abandoned by Joseph Richard (<email address hidden>) on branch: master
Review: https://review.opendev.org/698816

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/699734

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/699734
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=a9686eccf93dbbb89f8e460aa0fd449e688f957d
Submitter: Zuul
Branch: master

commit a9686eccf93dbbb89f8e460aa0fd449e688f957d
Author: Joseph Richard <email address hidden>
Date: Fri Dec 13 15:01:58 2019 -0500

    Prevent provisioning sriov without mgmt iface

    This commit adds a semantic check to prevent provisioning an sriov
    interface without a mgmt interface configured. This is necessary to
    prevent an invalid network config being generated and applied, which
    causes the node to lose connectivity over mgmt and become unreachable.

    Closes-bug: 1853329
    Change-Id: I783447a2214a1b0f4d698ac20037f8f1e8083958
    Signed-off-by: Joseph Richard <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/705837

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (f/centos8)
Download full text (35.0 KiB)

Reviewed: https://review.opendev.org/705837
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=8ac6ec70cb8a787a274fd7227eb34d2b7bcd5f5b
Submitter: Zuul
Branch: f/centos8

commit 7995dd436954b92f1c4e3f760a7609af670c84c8
Author: Jessica Castelino <email address hidden>
Date: Mon Feb 3 12:07:26 2020 -0500

    Unit test cases for helm charts

    Test cases added for API endpoints used by:
     1. helm-override-delete
     2. helm-override-show
     3. helm-override-list
     4. helm-override-update
     5. helm-chart-attribute-modify

    Story: 2007082
    Task: 38012
    Change-Id: I86763496bb41084c006f2486702c3b15bde039d2
    Signed-off-by: Jessica Castelino <email address hidden>

commit 7e2fda010299f7305b630d6db97bbe1e169a38b1
Author: Angie Wang <email address hidden>
Date: Wed Jan 29 21:18:18 2020 -0500

    Finish kubernetes networking upgrade support

    The commit completes the RPC kube_upgrade_networking
    in sysinv-conductor to run ansible playbook
    upgrade-k8s-networking.yml to upgrade networking pods
    and also updates the networking upgrade function called
    as part of sysinv-conductor startup to provide a current
    kubernetes version when running the upgrade playbook.

    The second control plane upgrade can only be performed
    after the networking upgrade is done, fix the semantic
    check in sysinv api.

    Change-Id: I8dcf5a2baedfaefb0a7ca037eb47bf7cacd686f8
    Story: 2006781
    Task: 37584
    Depends-On: https://review.opendev.org/#/c/705310/
    Signed-off-by: Angie Wang <email address hidden>

commit 52c37a35d2cd62fa1cc1933765c76c1ba8616864
Author: Jerry Sun <email address hidden>
Date: Fri Jan 31 16:10:25 2020 -0500

    Add Unit Tests for Dex Sysinv Changes

    Add unit tests for the dex helm chart changes under the same story
    and task

    Story: 2006711
    Task: 37857

    Depends-On: https://review.opendev.org/#/c/705297/

    Change-Id: I3a0e1c490e56188adfbd614fd6ebb21bfdddf49e
    Signed-off-by: Jerry Sun <email address hidden>

commit 144587a6ac9fc48b9249be99abadd35dfa49e7a7
Author: Teresa Ho <email address hidden>
Date: Fri Jan 31 15:35:04 2020 -0500

    Tox tests for OIDC client helm overrides

    Added some tox tests for OIDC client helm overrides.

    Story: 2006711
    Task: 38481

    Change-Id: If4aeaf0010c7076d1d83bacd00d6fd0122d4ffad
    Signed-off-by: Teresa Ho <email address hidden>

commit 763ddeadd4e83af6cebf752d693ee3e7d3b005b1
Author: Thomas Gao <email address hidden>
Date: Wed Jan 29 16:30:40 2020 -0500

    Fixed errors in address deletion

    Allowed address deletion despite missing associated interface or host.

    Enabled relevant unit test.

    Closes-Bug: 1860186

    Change-Id: Ie6e6358aa75091e92914a8b581b4d5203a596f56
    Signed-off-by: Thomas Gao <email address hidden>

commit 61463608169e75601b8a4f9db7c98190788d6f6a
Author: Thomas Gao <email address hidden>
Date: Tue Jan 28 15:32:58 2020 -0500

    Fixed broken sysinv address get-all api call

    Removed unexpected keyword argument that caused the error....

tags: added: in-f-centos8
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.