Subcloud upgrade-activate failed due to concurrent user config updates

Bug #2034446 reported by Manoel Benedito Neto
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Manoel Benedito Neto

Bug Description

Brief Description
-----------------
Upgrade failed to activate for subcloud showing kernel-ice driver errors.

Severity
--------
Major

Steps to Reproduce
------------------
DC upgrade orchestration applied on subclouds

Expected Behavior
-----------------
Successfully upgrade-activate the AIO-SX subcloud.

Actual Behavior
---------------
ERROR dcmanager.orchestrator.orch_thread [req-e8[hash] - - - - -] (upgrade) Failed! Stage: 1, State: activating upgrade, Subcloud: cn[hash]: Exception: Timeout waiting for activation to complete. Please check sysinv.log on the subcloud for details.

Reproducibility
---------------
Reproducible

System Configuration
--------------------
DC: Standard system controller duplex + 1 worker

Subcloud: AIO-SX

Branch/Pull Time/Commit
-----------------------
N/A

Load info (eg: 2022-03-10_20-00-07)
-----------------------------------
N/A

Last Pass
---------
N/A

Timestamp/Logs
--------------
Summary/key events:
#systemcontroller orchestrator.log
#Starting upgrade

INFO dcmanager.orchestrator.states.base [req-ef[hash] - - - - -] Stage: 1, State: starting upgrade, Subcloud: cn[hash], Details: Upgrade started. State=started

- Remote install of subcloud

INFO dccommon.subcloud_install [req-6f[hash] - - - - -] Start remote install cn[hash]
INFO dcmanager.orchestrator.states.base [req-6f[hash] - - - - -] Stage: 1, State: upgrading simplex, Subcloud: [hash], Details: Successfully installed subcloud

- Migrating Data (seen on central orchestrator logs)

INFO dcmanager.orchestrator.orch_thread [-] (upgrade) Stage: 1, State: migrating data, Subcloud: cn[hash]
INFO dcmanager.orchestrator.states.base [req-b1[hash] - - - - -] Stage: 1, State: migrating data, Subcloud: cn[hash], Details: Start migrating data...

First ice driver-related error:

cn[hash] kernel: err [ 2067.921487] ice 0000:18:00.3: Failed to stop Tx ring 0 on VSI 48
cn[hash] kernel: info [ 2067.921504] ice 0000:18:00.3: VF 4 failed opcode 6, retval: -5
cn[hash] kernel: err [ 2067.921638] iavf 0000:18:19.4: PF returned error -5 (IAVF_ERR_PARAM) to our request

This continues for quite long time..
#In midst of all the above ICE driver error messages in subcloud kernel.logs- Subcloud upgrade is in "activation stage" from systemcontroller.

INFO dcmanager.orchestrator.orch_thread [-] (upgrade) Stage: 1, State: activating upgrade, Subcloud: [hash] Line 6594: 2023-08-11 04:36:45.960 304800 INFO dcmanager.orchestrator.states.base [req-e8[hash] - - - - -] Stage: 1, State: activating upgrade, Subcloud: cn[hash], Details: Activation in progress, waiting... State=activating

Alarms
------
NA

Test Activity
-------------
Normal use

Workaround
----------
The upgrade strategy should be re-applied from system controller, this should fix the upgrade-activate failure.

Changed in starlingx:
assignee: nobody → Manoel Benedito Neto (mbenedit)
Changed in starlingx:
status: New → In Progress
Revision history for this message
Ghada Khalil (gkhalil) wrote :
information type: Private Security → Public
Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.9.0 stx.distcloud stx.security
Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Ghada Khalil (gkhalil) wrote :

There's a bug with the initial code, so re-opening to add a follow-up fix

Changed in starlingx:
status: Fix Released → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/config/+/915030

Ghada Khalil (gkhalil)
tags: added: stx.10.0
removed: stx.9.0
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/c/starlingx/config/+/915030
Committed: https://opendev.org/starlingx/config/commit/5d853423efdb4ba020f7a54f0b1295ef150a26f2
Submitter: "Zuul (22348)"
Branch: master

commit 5d853423efdb4ba020f7a54f0b1295ef150a26f2
Author: Rei Oliveira <email address hidden>
Date: Wed Apr 3 19:20:06 2024 -0300

    Wrap 'classes' parameter as a list in config_dict object

    This change fixes a type mismatch bug introduced in [1]. A python list
    is expected but a python str is provided instead.

    [1] https://review.opendev.org/c/starlingx/config/+/893566

    This type mismatch will result in the 'deadlock' prevention logic to
    never be invoked. In [2] below, the 'if classes' branch is never entered:

    [2] https://opendev.org/starlingx/config/src/commit/85a548ffcc77d708de848a4648352f8481543695/sysinv/sysinv/sysinv/sysinv/conductor/manager.py#L13481

    Test plan:

    PASS: Run 'sudo chage -M 999 sysadmin; sudo chage -M 888 sysadmin;
          sudo chage -M 777 sysadmin'. Notice 'out of config alarm' in
          'fm alarm-list'. Verify that it clears up after about 5 min.
    PASS: Verify in i_user db table and /etc/shadow that it correctly
          contains the last password age, 777 in this case.

    Note: In a managed subcloud, the value in /etc/shadow file will
    be changed again in about 20 min to sync with the sysadmin password
    and age in the system controller.

    Closes-Bug: 2034446

    Signed-off-by: Rei Oliveira <email address hidden>
    Change-Id: I24d9807e9eb2d94e026be7b8f3448a6cd42fcdd6

Changed in starlingx:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.