some storage nodes staying unlocked and cannot be locked/force-locked after controller-0/system restored

Bug #1798197 reported by mhg
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Wei Zhou

Bug Description

Brief Description
-----------------
After powered on and ran 'config_controller --restore-system <bk_system...>' on the controller-0, some of the storage nodes were in 'unlocked | enabled | degraded' states.

Attempt to lock them failed with error:
[wrsroot@controller-0 ~(keystone_admin)]$ system host-lock storage-1
Cannot lock a storage node when ceph pool usage is undetermined.
[wrsroot@controller-0 ~(keystone_admin)]$ echo $?
1
Force-lock also failed:
[wrsroot@controller-0 ~(keystone_admin)]$ system host-lock -f storage-1
Cannot lock a storage node when ceph pool usage is undetermined.

Severity
--------
Major

Steps to Reproduce
------------------
1 backup the system and save the backup files
2 install the controller-0 with the same load (powered off all other nodes first)
3 run 'sudo config_controller --restore-system <system_backup_file>'
4 source /etc/nova/openrc
5 lock any nodes (exception controller-0) if they are unlocked, use 'force-lock' if 'lock' is not working

Expected Behavior
------------------
All nodes should be able to be locked, or can be locked, or can be force-locked.

Actual Behavior
----------------
For the unlocked storage nodes, 'system host-lock' failed with error message, e.g. for storage-1:
[wrsroot@controller-0 ~(keystone_admin)]$ system host-lock storage-1
Cannot lock a storage node when ceph pool usage is undetermined.

Force-lock also failed:
[wrsroot@controller-0 ~(keystone_admin)]$ system host-lock storage-1
Cannot lock a storage node when ceph pool usage is undetermined.

Reproducibility
---------------
Reproducible on pv0

System Configuration
--------------------
Dedicated storage 2 + 6 + 4

Branch/Pull Time/Commit
-----------------------
StarlingX_18.10 as of 2018-10-12_01-52-00

Timestamp/Logs
--------------
20181016 16:56:05

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Targeting stx.2019.03 as this is specific to a 6-storage node config which is not very common. It was confirmed that this is not an issue for 2-storage node configs.

Changed in starlingx:
assignee: nobody → Wei Zhou (wzhou007)
importance: Undecided → Medium
status: New → Triaged
tags: added: stx.2019.03 stx.config
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-config (master)

Fix proposed to branch: master
Review: https://review.openstack.org/612422

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-config (master)

Reviewed: https://review.openstack.org/612422
Committed: https://git.openstack.org/cgit/openstack/stx-config/commit/?id=5e2772b25e92730e9be990f57e9aab57c3412135
Submitter: Zuul
Branch: master

commit 5e2772b25e92730e9be990f57e9aab57c3412135
Author: Wei Zhou <email address hidden>
Date: Mon Oct 22 10:55:32 2018 -0400

    Lock storage nodes during system restore

    Currently, as part of system restore, maintenance will lock all the
    nodes except for controller-0. When locking storage nodes if the
    replication will be lost, we query ceph pools to see if they are
    empty. But during restore, ceph cluster is not up yet; therefore
    query ceph pools will fail and it will block locking storage nodes.

    Solution: In the case of the restore, if it is a force-lock, we
    should skip the ceph pool query and just lock the storage node.

    Change-Id: I0501b44ebe635f36d7401437bfe289de7ee9fc73
    Closes-Bug: 1798197
    Signed-off-by: Wei Zhou <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
mhg (marvinhg) wrote :

Retest on 6 storage-node lab reproduced the issue:

[2018-11-12 20:04:04,263] 262 DEBUG MainThread ssh.send :: Send 'system host-lock storage-2'
[2018-11-12 20:04:06,451] 382 DEBUG MainThread ssh.expect :: Output:
Cannot lock a storage node when ceph pool usage is undetermined.
[wrsroot@controller-0 ~(keystone_admin)]$
...
[2018-11-12 20:04:08,644] 262 DEBUG MainThread ssh.send :: Send 'system host-lock storage-5'
[2018-11-12 20:04:10,850] 382 DEBUG MainThread ssh.expect :: Output:
Cannot lock a storage node when ceph pool usage is undetermined.
[wrsroot@controller-0 ~(keystone_admin)]$

Load rested with:
Load:
###
### StarlingX
### Release 18.10
###

SW_VERSION="18.10"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="2018-11-11_20-18-00"
SRC_BUILD_ID="134"

JOB="StarlingX_Upstream_build"
BUILD_BY="jenkins"
BUILD_NUMBER="134"
BUILD_HOST="yow-cgts1-lx"
BUILD_DATE="2018-11-11 20:19:24 -0500"

Note:
- note the difference from case it was originally found:
  this run, only 2 storage nodes were 'unlocked' right after 'config_controller --restore-system',
  while originally there were more nodes in 'unlocked' status.

Changed in starlingx:
status: Fix Released → In Progress
Revision history for this message
Wei Zhou (wzhou007) wrote :
Revision history for this message
Ghada Khalil (gkhalil) wrote :

The issue reported by #1801772 is addressed by:
https://review.openstack.org/618127

Marking as Fix Released

Changed in starlingx:
status: In Progress → Fix Released
Ken Young (kenyis)
tags: added: stx.2019.05
removed: stx.2019.03
Ken Young (kenyis)
tags: added: stx.2.0
removed: stx.2019.05
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.