StarlingX

some storage nodes staying unlocked and cannot be locked/force-locked after controller-0/system restored

Bug #1798197 reported by mhg on 2018-10-16

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	Medium	Wei Zhou

Bug Description

Brief Description
-----------------
After powered on and ran 'config_controller --restore-system <bk_system...>' on the controller-0, some of the storage nodes were in 'unlocked | enabled | degraded' states.

Attempt to lock them failed with error:
[wrsroot@controller-0 ~(keystone_admin)]$ system host-lock storage-1
Cannot lock a storage node when ceph pool usage is undetermined.
[wrsroot@controller-0 ~(keystone_admin)]$ echo $?
1
Force-lock also failed:
[wrsroot@controller-0 ~(keystone_admin)]$ system host-lock -f storage-1
Cannot lock a storage node when ceph pool usage is undetermined.

Severity
--------
Major

Steps to Reproduce
------------------
1 backup the system and save the backup files
2 install the controller-0 with the same load (powered off all other nodes first)
3 run 'sudo config_controller --restore-system <system_backup_file>'
4 source /etc/nova/openrc
5 lock any nodes (exception controller-0) if they are unlocked, use 'force-lock' if 'lock' is not working

Expected Behavior
------------------
All nodes should be able to be locked, or can be locked, or can be force-locked.

Actual Behavior
----------------
For the unlocked storage nodes, 'system host-lock' failed with error message, e.g. for storage-1:
[wrsroot@controller-0 ~(keystone_admin)]$ system host-lock storage-1
Cannot lock a storage node when ceph pool usage is undetermined.

Force-lock also failed:
[wrsroot@controller-0 ~(keystone_admin)]$ system host-lock storage-1
Cannot lock a storage node when ceph pool usage is undetermined.

Reproducibility
---------------
Reproducible on pv0

System Configuration
--------------------
Dedicated storage 2 + 6 + 4

Branch/Pull Time/Commit
-----------------------
StarlingX_18.10 as of 2018-10-12_01-52-00

Timestamp/Logs
--------------
20181016 16:56:05

Tags:

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2018-10-17:

Targeting stx.2019.03 as this is specific to a 6-storage node config which is not very common. It was confirmed that this is not an issue for 2-storage node configs.

Changed in starlingx:
assignee:	nobody → Wei Zhou (wzhou007)
importance:	Undecided → Medium
status:	New → Triaged
tags:	added: stx.2019.03 stx.config

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-10-22: Fix proposed to stx-config (master)

Fix proposed to branch: master
Review: https://review.openstack.org/612422

Changed in starlingx:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-10-24: Fix merged to stx-config (master)

Reviewed: https://review.openstack.org/612422
Committed: https://git.openstack.org/cgit/openstack/stx-config/commit/?id=5e2772b25e92730e9be990f57e9aab57c3412135
Submitter: Zuul
Branch: master

commit 5e2772b25e92730e9be990f57e9aab57c3412135
Author: Wei Zhou <email address hidden>
Date: Mon Oct 22 10:55:32 2018 -0400

Lock storage nodes during system restore

    Currently, as part of system restore, maintenance will lock all the
    nodes except for controller-0. When locking storage nodes if the
    replication will be lost, we query ceph pools to see if they are
    empty. But during restore, ceph cluster is not up yet; therefore
    query ceph pools will fail and it will block locking storage nodes.

Solution: In the case of the restore, if it is a force-lock, we
should skip the ceph pool query and just lock the storage node.

    Change-Id: I0501b44ebe635f36d7401437bfe289de7ee9fc73
    Closes-Bug: 1798197
    Signed-off-by: Wei Zhou <email address hidden>

Changed in starlingx:
status:	In Progress → Fix Released

Revision history for this message

mhg (marvinhg) wrote on 2018-11-12:

Retest on 6 storage-node lab reproduced the issue:

[2018-11-12 20:04:04,263] 262 DEBUG MainThread ssh.send :: Send 'system host-lock storage-2'
[2018-11-12 20:04:06,451] 382 DEBUG MainThread ssh.expect :: Output:
Cannot lock a storage node when ceph pool usage is undetermined.
[wrsroot@controller-0 ~(keystone_admin)]$
...
[2018-11-12 20:04:08,644] 262 DEBUG MainThread ssh.send :: Send 'system host-lock storage-5'
[2018-11-12 20:04:10,850] 382 DEBUG MainThread ssh.expect :: Output:
Cannot lock a storage node when ceph pool usage is undetermined.
[wrsroot@controller-0 ~(keystone_admin)]$

Load rested with:
Load:
###
### StarlingX
### Release 18.10
###

SW_VERSION="18.10"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="2018-11-11_20-18-00"
SRC_BUILD_ID="134"

JOB="StarlingX_Upstream_build"
BUILD_BY="jenkins"
BUILD_NUMBER="134"
BUILD_HOST="yow-cgts1-lx"
BUILD_DATE="2018-11-11 20:19:24 -0500"

Note:
- note the difference from case it was originally found:
this run, only 2 storage nodes were 'unlocked' right after 'config_controller --restore-system',
while originally there were more nodes in 'unlocked' status.

Changed in starlingx:
status:	Fix Released → In Progress

Revision history for this message

Wei Zhou (wzhou007) wrote on 2018-11-13:

The root cause is https://bugs.launchpad.net/starlingx/+bug/1801772

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2018-11-16:

The issue reported by #1801772 is addressed by:
https://review.openstack.org/618127

Marking as Fix Released

Changed in starlingx:
status:	In Progress → Fix Released

Ken Young (kenyis) on 2019-01-18

tags:

added: stx.2019.05
removed: stx.2019.03

Ken Young (kenyis) on 2019-04-05

tags:

added: stx.2.0
removed: stx.2019.05

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.