Upgrade failed due to manual route config during upgrade

Bug #1970205 reported by John Kung
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
John Kung

Bug Description

Description:
Upgrade Upgrade to stx6.0 failed when route config allowing during upgrade.

Severity
Major, blocking the progress of a number of DC related upgrades JIRA. Workaround is to lock/unlock. However, issue is due to manual config steps during upgrade.

Steps to Reproduce
Start the system controller upgrade job

Expected Behavior
System controller upgrade succeeds

Actual Behavior
Controller-0 upgrade fails
If you search "boot menu" in following web page:
http://128.224.150.21/jenkins/view/Upgrade/job/SYSTEM_UPGRADE/629/consoleFull

[2021-12-13 04:19:06,165] 95 INFO controller-0 menu.select :: Attempt to select boot device option [10;22HIBA XE Slot 8301 v2140 index 2
[2021-12-13 04:19:06,166] 99 INFO controller-0 menu.select :: Current index = 0
[2021-12-13 04:19:06,166] 125 INFO controller-0 menu.move_down:: Press: Down
[2021-12-13 04:19:07,168] 99 INFO controller-0 menu.select :: Current index = 1
[2021-12-13 04:19:07,168] 125 INFO controller-0 menu.move_down:: Press: Down
[2021-12-13 04:19:08,170] 686 INFO controller-0 menu.enter :: Press Enter (
)to select [10;22HIBA XE Slot 8301 v2140 option

In the job I initiated
http://128.224.150.21/jenkins/view/Upgrade/job/SYSTEM_UPGRADE/824/consoleFull
there is no such attempt.

Additional info:
It failed to swact to controller-1 due to the config-out-of-date alarm.
sysinv 2022-04-06 18:56:54.764 105950 INFO sysinv.agent.manager [-] config_apply_runtime_manifest: 4aa99f61-2ceb-4748-8b86-c79059c46e29

{u'classes': u'platform::network::routes::runtime', u'force': False, u'personalities': [u'controller', u'worker', u'storage'], u'host_uuids': [u'5600cd36-f2d1-456e-9304-480b8877de3c']}
controller
sysinv 2022-04-06 18:56:54.765 105950 INFO sysinv.agent.manager [-] controller-active
sysinv 2022-04-06 18:56:54.765 105950 INFO sysinv.agent.manager [-] _apply_runtime_manifest with hieradata_path = '/opt/platform/puppet/21.05/hieradata'
sysinv 2022-04-06 18:56:55.652 114512 INFO sysinv.conductor.manager [-] _config_update_hosts personalities=['controller', 'worker', 'storage'] host_uuids=[u'48aca7e0-61e2-4e42-bdee-2eb3caba3164'] reboot=False config_uuid=3af91f4b-5ab5-452f-a8a0-402c882868e8 tb= File "/usr/lib64/python2.7/site-packages/sysinv/conductor/manager.py", line 6629, in update_route_config
host_uuids=[host.uuid])
sysinv 2022-04-06 18:56:55.660 114512 INFO sysinv.conductor.manager [-] Setting config target of host 'controller-1' to 'baf91f4b-5ab5-452f-a8a0-402c882868e8'.
sysinv 2022-04-06 18:56:55.669 114512 WARNING sysinv.conductor.manager [-] controller-1: iconfig out of date: target baf91f4b-5ab5-452f-a8a0-402c882868e8, applied 17da2050-446e-434c-bb33-b84d6b1e4abc
sysinv 2022-04-06 18:56:55.670 114512 WARNING sysinv.conductor.manager [-] SYS_I Raise system config alarm: host controller-1 config applied: 17da2050-446e-434c-bb33-b84d6b1e4abc vs. target: baf91f4b-5ab5-452f-a8a0-402c882868e8.
sysinv 2022-04-06 18:56:55.712 114512 INFO sysinv.conductor.manager [-] _config_update_hosts config_uuid=3af91f4b-5ab5-452f-a8a0-402c882868e8
sysinv 2022-04-06 18:56:55.723 114512 INFO sysinv.conductor.manager [-] Skip applying manifest for host: controller-1. Version 21.12 mismatch.
sysinv 2022-04-06 18:56:55.723 114512 INFO sysinv.conductor.manager [-] _remove_config_from_reboot_config_list host: 48aca7e0-61e2-4e42-bdee-2eb3caba3164,config_uuid: 3af91f4b-5ab5-452f-a8a0-402c882868e8
sysinv 2022-04-06 18:56:55.723 114512 INFO sysinv.conductor.manager [-] _remove_config_from_reboot_config_list fail host:48aca7e0-61e2-4e42-bdee-2eb3caba3164 config_uuid 3af91f4b-5ab5-452f-a8a0-402c882868e8
sysinv 2022-04-06 18:56:55.735 114512 INFO sysinv.conductor.manager [-] controller-1: 48aca7e0-61e2-4e42-bdee-2eb3caba3164 reboot required config_applied 3af91f4b-5ab5-452f-a8a0-402c882868e8 host_reboot_config ['c72fc2e9-9822-4d0a-b890-e85b9f232248']
sysinv 2022-04-06 18:56:55.735 114512 WARNING sysinv.conductor.manager [-] SYS_I Raise system config alarm: host controller-1 config applied: 3af91f4b-5ab5-452f-a8a0-402c882868e8 vs. target: baf91f4b-5ab5-452f-a8a0-402c882868e8.

The config target requires a lock/unlock because baf91f4b-5ab5-452f-a8a0-402c882868e8 is a 'reboot-required' config change.

Reproducibility
Seen once.

System Configuration
Distributed Cloud

Load info (eg: 2022-03-10_20-00-07)
stx6.0

Branch and the time when code was pulled or git commit or cengn load info

Last Pass
Upgrade worked many times before in this lab

Timestamp/Logs
The issue is triggered due to subcloud add/delete operations during upgrade.

After a 'system upgrade-start' operation, the database is already snapshot between the 2 controllers.

The upgrade-start operation at 18:32:08

The route configuration changes are due to manual dcmanager cli commands to add/delete subcloud during the upgrade:

2022-04-06T18:36:47.000 controller-0 -sh: info HISTORY: PID=2348352 UID=42425 dcmanager subcloud add --bootstrap-address 2620:10a:a001:d41::260 --bootstrap-values subcloud3001_ipv6-bootstrap-values.yaml --install-values subcloud3001-install-values.yaml --sysadmin-password xxxxxx --bmc-password xxxxxx2022-04-06T18:37:00.000 controller-0 -sh: info HISTORY: PID=2348352 UID=42425 cd subcloud-30012022-04-06T18:37:03.000 controller-0 -sh: info HISTORY: PID=2348352 UID=42425 dcmanager subcloud add --bootstrap-address 2620:10a:a001:d41::260 --bootstrap-values subcloud3001_ipv6-bootstrap-values.yaml --install-values subcloud3001-install-values.yaml --sysadmin-password xxxxxx --bmc-password xxxxxx

This results in a 250.001 Config out of date alarm which will prevent the host-swact of the upgrade step from completing (until host-lock/unlock workaround)

Alarms
250.001 Config out of date Alarm

Test Activity
Developer Testing

Workaround
host-lock/unlock controller to clear the (reboot-required) config out of date alarm

John Kung (john-kung)
summary: - Upgrade to stx6.0 failed due to manual route config during upgrade
+ Upgrade failed due to manual route config during upgrade
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/config/+/839202

Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/c/starlingx/config/+/839202
Committed: https://opendev.org/starlingx/config/commit/e3686678caf0181f230c8ddc0ca9ce5abb916e81
Submitter: "Zuul (22348)"
Branch: master

commit e3686678caf0181f230c8ddc0ca9ce5abb916e81
Author: John Kung <email address hidden>
Date: Mon Apr 25 08:54:23 2022 -0400

    Disallow route config during upgrade states

    It is observed that a subcloud add operation following upgrade-start
    will result in a route config operation which will not be applied
    on the N+1 side.

    After an upgrade-start, which snapshots the config database,
    config operations could prevent an upgrade from being orchestrated.

    Route config operations are now disallowed until after the
    N+1 controller is upgraded and during upgrade abort (as the abort's
    not cancellable) as the load is being returned to the N load.

    After the N+1 controller is upgraded, allow route config operations as
    either the upgrade will complete and the system will be consistent;
    or the operation can be aborted,
    and the system restored to backedup state.

    Tox unit tests added to cover the route config during various
    upgrade states.

    Closes-Bug: 1970205

    Test Plan:
    PASSED route tox unit tests
    PASSED Disallow route config during Upgrade
    PASSED SystemController Duplex Upgrade
    PASSED Distributed Cloud Subcloud Upgrade
    PASSED Simplex Upgrade

    Signed-off-by: John Kung <email address hidden>
    Change-Id: If0cc68286ec87858b0517234e8bb08c4ed2ad851

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: nobody → John Kung (john-kung)
tags: added: stx.7.0 stx.config
Changed in starlingx:
importance: Undecided → Medium
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.