After Controller lock and unlock, Controller goes to failed state
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
Medium
|
Heitor Matsui |
Bug Description
Brief Description
-----------------
After the host lock and unlock on controller-0 it is going to a failed state.
Severity
--------
Critical: unable to use subclouds
Steps to Reproduce
------------------
1. Install DC lab with 22.06 and upgrade patches
2. Follow upgrade steps for DC and upgrade Central and sub cloud .
3. lock and unlock Controller-0 and verify the state of the host
Expected Behavior
------------------
No failure. host state
Actual Behavior
----------------
system host-list
+----+-
| id | hostname | personality | administrative | operational | availability |
+----+-
| 1 | controller-0 | controller | unlocked | disabled | failed |
| 2 | controller-1 | controller | unlocked | enabled | available |
+----+-
Reproducibility
---------------
Intermittent
System Configuration
-------
DC
Branch/Pull Time/Commit
-------
2022-11-30_22-00-06
Last Pass
---------
N/A
Timestamp/Logs
--------------
puppet.log:
Error: 2022-12-03 03:45:22 +0000 Evaluation Error: The title '/dev/disk/
Test Activity
-------------
Regression test
Workaround
----------
Manually delete from the database the duplicated pv records not in use, then lock/unlock the host
Changed in starlingx: | |
status: | New → In Progress |
Changed in starlingx: | |
assignee: | nobody → Heitor Matsui (heitormatsui) |
Changed in starlingx: | |
importance: | Undecided → Medium |
tags: | added: stx.8.0 stx.config |
Reviewed: https:/ /review. opendev. org/c/starlingx /config/ +/867525 /opendev. org/starlingx/ config/ commit/ ca6bc490b7a7df6 46827e80395e4a5 f0b71d6301
Committed: https:/
Submitter: "Zuul (22348)"
Branch: master
commit ca6bc490b7a7df6 46827e80395e4a5 f0b71d6301
Author: Heitor Matsui <email address hidden>
Date: Tue Dec 13 12:34:39 2022 -0300
Use device_path to determine if PV is found
After upgrading a host we observed a rare disk enumeration
issue, that would later duplicate PVs, since it will report
a different device_node and thus not match condition [1].
This occurs when the same persistent device name used between
CentOS and Debian points to different kernel derived device
nodes. This is a unique scenario not previously handled by the
conductor logic and due to a much later version of systemd/udev
used in Debian vs. Centos.
This commit adds logic to fetch device_path earlier and then
use it to determine if PV is found at the upgraded system,
leading the PV to being updated instead of created again.
[1] https:/ /opendev. org/starlingx/ config/ src/commit/ 748afd7f5b7d3fc 5e958f7173ff1a1 9c946c73b4/ sysinv/ sysinv/ sysinv/ sysinv/ conductor/ manager. py#L5052
Test Plan bootstrap/ unlock
PASS: fresh install/
PASS: host lock/unlock
PASS: upgrade AIO-DX
PASS: force the enumeration issue via database and observe that
existing PV is updated instead of duplicated after agent
reports back to conductor
Closes-bug: 1999679
Change-Id: I43ae44f088c84b 45a7a23c46d1ffc a4568673e39
Signed-off-by: Heitor Matsui <email address hidden>