Aborting the upgrade for controller-0 in a duplex env fails

Bug #1929884 reported by Al Bailey
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Low
Al Bailey

Bug Description

Brief Description
-----------------
When performing a software upgrade of an AIO-DX system, if the upgrade is aborted after both controllers have been upgraded, the documentation indicates that it should be possible to abort the upgrade and downgrade the controllers.

It turns out that there is a file permission issue blocking the host-downgrade command for controller-0

Note: this is a very low priority issue, an abort would normally never happen, and if it did it would normally occur much earlier in the procedure. Also, very few people are doing upgrades.

A similar issue was encountered (and fixed) for another flag by this submission
https://review.opendev.org/c/starlingx/config/+/675673

Severity
--------
Minor

Steps to Reproduce
------------------
Perform a platform upgrade of an AIO-DX system, up to the point where both controllers are unlocked.
Then do the following:
 - system upgrade-abort
 - (swact from controller-0 to controller-1 to make controller-1 active)
 - system host-lock controller-0
 - (wait for controller-0 to lock)
 - system host-downgrade controller-0

Expected Behavior
------------------
It should intitiate the downgrade

Actual Behavior
----------------
 system host-downgrade controller-0
[Errno 13] Permission denied: '/etc/platform/.upgrade_rollback'

Reproducibility
---------------
Reproducable

System Configuration
--------------------
AIO-DX (upgrading from custom r4 to custom r5)

Branch/Pull Time/Commit
-----------------------
Custom Load built May 25

Last Pass
---------
Unknown.

Timestamp/Logs
--------------
sysinv 2021-05-27 23:12:31.547 98746 ERROR wsme.api [-] Server-side error: "[Errno 13] Permission denied: '/etc/platform/.upgrade_rollback'". Detail:
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/wsmeext/pecan.py", line 85, in callfunction
    result = f(self, *args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/sysinv/api/controllers/v1/host.py", line 2773, in downgrade
    open(tsc.UPGRADE_ROLLBACK_FLAG, "w").close()
IOError: [Errno 13] Permission denied: '/etc/platform/.upgrade_rollback'

Test Activity
-------------
Developer Testing

Workaround
----------
sudo touch /etc/platform/.upgrade_rollback
sudo chmod 666 /etc/platform/.upgrade_rollback

Revision history for this message
Al Bailey (albailey1974) wrote :

With the workaround, the code fails further on. I will address that as well

system host-downgrade controller-0
Remote error: SysinvException Unable to shut down ceph storage monitor.
[u'Traceback (most recent call last):\n', u' File "/usr/lib64/python2.7/site-packages/sysinv/openstack/common/rpc/amqp.py", line 437, in _process_data\n **args)\n'
, u' File "/usr/lib64/python2.7/site-packages/sysinv/openstack/common/rpc/dispatcher.py", line 172, in dispatch\n result = getattr(proxyobj, method)(ctxt, **kwargs)\n'
, u' File "/usr/lib64/python2.7/site-packages/sysinv/conductor/manager.py", line 6486, in kill_ceph_storage_monitor\n _("Unable to shut down ceph storage monitor."))\n'
, u'SysinvException: Unable to shut down ceph storage monitor.\n'].

Changed in starlingx:
assignee: nobody → Al Bailey (albailey1974)
description: updated
Revision history for this message
Al Bailey (albailey1974) wrote :

called from:
  File "/usr/lib/python2.7/site-packages/wsmeext/pecan.py", line 85, in callfunction
    result = f(self, *args, **kwargs)

  File "/usr/lib64/python2.7/site-packages/sysinv/api/controllers/v1/host.py", line 2820, in downgrade
    pecan.request.context)

  File "/usr/lib64/python2.7/site-packages/sysinv/conductor/rpcapi.py", line 775, in kill_ceph_storage_monitor
    self.make_msg('kill_ceph_storage_monitor'))

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/config/+/793616

Changed in starlingx:
status: New → In Progress
Revision history for this message
Al Bailey (albailey1974) wrote :

sysinv 2021-05-28 15:33:32.699 97830 ERROR sysinv.conductor.manager [-] Command '['mv', '/etc/pmon.d/ceph.conf', '/etc/pmond.ceph.conf.bak']' returned non-zero exit status 1: CalledProcessError: Command '['mv', '/etc/pmon.d/ceph.conf', '/etc/pmond.ceph.conf.bak']' returned non-zero exit status 1
2021-05-28 15:33:32.699 97830 ERROR sysinv.conductor.manager Traceback (most recent call last):
2021-05-28 15:33:32.699 97830 ERROR sysinv.conductor.manager File "/usr/lib64/python2.7/site-packages/sysinv/conductor/manager.py", line 6472, in kill_ceph_storage_monitor
2021-05-28 15:33:32.699 97830 ERROR sysinv.conductor.manager stdout=fnull, stderr=fnull)
2021-05-28 15:33:32.699 97830 ERROR sysinv.conductor.manager File "/usr/lib64/python2.7/subprocess.py", line 542, in check_call
2021-05-28 15:33:32.699 97830 ERROR sysinv.conductor.manager raise CalledProcessError(retcode, cmd)
2021-05-28 15:33:32.699 97830 ERROR sysinv.conductor.manager CalledProcessError: Command '['mv', '/etc/pmon.d/ceph.conf', '/etc/pmond.ceph.conf.bak']' returned non-zero exit status 1

Revision history for this message
Al Bailey (albailey1974) wrote :

There is no /etc/pmon.d/ceph.conf file in this env.

Revision history for this message
Al Bailey (albailey1974) wrote :

According to this info, on a DX system ceph is not managed by pmon
https://github.com/starlingx/stx-puppet/blob/master/puppet-manifests/src/modules/platform/manifests/ceph.pp#L851

so that step needs to be revisited.

Changed in starlingx:
importance: Undecided → Low
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/config/+/793647

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/c/starlingx/config/+/793616
Committed: https://opendev.org/starlingx/config/commit/0dc99eee608336fe01b58821ea404286371f1408
Submitter: "Zuul (22348)"
Branch: master

commit 0dc99eee608336fe01b58821ea404286371f1408
Author: albailey <email address hidden>
Date: Fri May 28 11:05:43 2021 -0500

    Fix file permissions failure during duplex upgrade abort

    When issuing a downgrade for controller-0 in a duplex upgrade
    abort and rollback scenario, the downgrade command was failing
    because the sysinv API does not have root permissions to set
    a file flag.
    The fix is to use RPC so the conductor can create the flag
    and allow the downgrade for controller-0 to get further.

    Partial-Bug: 1929884
    Signed-off-by: albailey <email address hidden>
    Change-Id: I913bcad73309fe887a12cbb016a518da93327947

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/c/starlingx/config/+/793647
Committed: https://opendev.org/starlingx/config/commit/31c77439d2cea590dfcca13cfa646522665f8686
Submitter: "Zuul (22348)"
Branch: master

commit 31c77439d2cea590dfcca13cfa646522665f8686
Author: albailey <email address hidden>
Date: Fri May 28 13:42:42 2021 -0500

    Fix controller-0 downgrade failing to kill ceph

    kill_ceph_storage_monitor tried to manipulate a pmon
    file that does not exist in an AIO-DX environment.

    We no longer invoke kill_ceph_storage_monitor in an
    AIO SX or DX env.

    This allows: "system host-downgrade controller-0"
    to proceed in an AIO-DX environment where that second
    controller (controller-0) was upgraded.

    Partial-Bug: 1929884
    Signed-off-by: albailey <email address hidden>
    Change-Id: I633853f75317736084feae96b5b849c601204c13

Al Bailey (albailey1974)
Changed in starlingx:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/794611

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/794906

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on config (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/794611

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (f/centos8)
Download full text (147.3 KiB)

Reviewed: https://review.opendev.org/c/starlingx/config/+/794906
Committed: https://opendev.org/starlingx/config/commit/75758b37a5a23c8811355b67e2a430a1713cd85b
Submitter: "Zuul (22348)"
Branch: f/centos8

commit 9e420d9513e5fafb1df4d29567bc299a9e04d58d
Author: Bin Qian <email address hidden>
Date: Mon May 31 14:45:52 2021 -0400

    Add more logging to run docker login

    Add error log for running docker login. The new log could
    help identify docker login failure.

    Closes-Bug: 1930310
    Change-Id: I8a709fb6665de8301fbe3022563499a92b2a0211
    Signed-off-by: Bin Qian <email address hidden>

commit 31c77439d2cea590dfcca13cfa646522665f8686
Author: albailey <email address hidden>
Date: Fri May 28 13:42:42 2021 -0500

    Fix controller-0 downgrade failing to kill ceph

    kill_ceph_storage_monitor tried to manipulate a pmon
    file that does not exist in an AIO-DX environment.

    We no longer invoke kill_ceph_storage_monitor in an
    AIO SX or DX env.

    This allows: "system host-downgrade controller-0"
    to proceed in an AIO-DX environment where that second
    controller (controller-0) was upgraded.

    Partial-Bug: 1929884
    Signed-off-by: albailey <email address hidden>
    Change-Id: I633853f75317736084feae96b5b849c601204c13

commit 0dc99eee608336fe01b58821ea404286371f1408
Author: albailey <email address hidden>
Date: Fri May 28 11:05:43 2021 -0500

    Fix file permissions failure during duplex upgrade abort

    When issuing a downgrade for controller-0 in a duplex upgrade
    abort and rollback scenario, the downgrade command was failing
    because the sysinv API does not have root permissions to set
    a file flag.
    The fix is to use RPC so the conductor can create the flag
    and allow the downgrade for controller-0 to get further.

    Partial-Bug: 1929884
    Signed-off-by: albailey <email address hidden>
    Change-Id: I913bcad73309fe887a12cbb016a518da93327947

commit 7ef3724dad173754e40b45538b1cc726a458cc1c
Author: Chen, Haochuan Z <email address hidden>
Date: Tue May 25 16:16:29 2021 +0800

    Fix bug rook-ceph provision with multi osd on one host

    Test case:
    1, deploy simplex system
    2, apply rook-ceph with below override value
    value.yaml
    cluster:
      storage:
        nodes:
        - name: controller-0
          devices:
          - name: sdb
          - name: sdc
    3, reboot

    Without this fix, only osd pod could launch successfully after boot
    as vg start with ceph could not correctly add in sysinv-database

    Closes-bug: 1929511

    Change-Id: Ia5be599cd168d13d2aab7b5e5890376c3c8a0019
    Signed-off-by: Chen, Haochuan Z <email address hidden>

commit 23505ba77d76114cf8a0bf833f9a5bcd05bc1dd1
Author: Angie Wang <email address hidden>
Date: Tue May 25 18:49:21 2021 -0400

    Fix issue in partition data migration script

    The created partition dictonary partition_map is not
    an ordered dict so we need to sort it by its key -
    device node when iterating it to adjust the device
    nodes/paths for user created extra partitions to ensure
    the number of device node...

tags: added: in-f-centos8
Ghada Khalil (gkhalil)
Changed in starlingx:
status: Fix Committed → Fix Released
tags: added: stx.config
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.