After Controller lock and unlock, Controller goes to failed state

Bug #1999679 reported by Heitor Matsui
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Heitor Matsui

Bug Description

Brief Description
-----------------
After the host lock and unlock on controller-0 it is going to a failed state.

Severity
--------
Critical: unable to use subclouds

Steps to Reproduce
------------------
1. Install DC lab with 22.06 and upgrade patches
2. Follow upgrade steps for DC and upgrade Central and sub cloud .
3. lock and unlock Controller-0 and verify the state of the host

Expected Behavior
------------------
No failure. host state

Actual Behavior
----------------
 system host-list
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | disabled | failed |
| 2 | controller-1 | controller | unlocked | enabled | available |
+----+--------------+-------------+----------------+-------------+--------------+

Reproducibility
---------------
Intermittent

System Configuration
--------------------
DC

Branch/Pull Time/Commit
-----------------------
2022-11-30_22-00-06

Last Pass
---------
N/A

Timestamp/Logs
--------------
puppet.log:
Error: 2022-12-03 03:45:22 +0000 Evaluation Error: The title '/dev/disk/by-path/pci-0000:00:1f.2-ata-5.0-part4' has already been used in this resource expression (file: /usr/share/puppet/modules/platform/manifests/lvm.pp, line: 57, column: 3) on node controller-0

Test Activity
-------------
Regression test

Workaround
----------
Manually delete from the database the duplicated pv records not in use, then lock/unlock the host

Changed in starlingx:
status: New → In Progress
Changed in starlingx:
assignee: nobody → Heitor Matsui (heitormatsui)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/c/starlingx/config/+/867525
Committed: https://opendev.org/starlingx/config/commit/ca6bc490b7a7df646827e80395e4a5f0b71d6301
Submitter: "Zuul (22348)"
Branch: master

commit ca6bc490b7a7df646827e80395e4a5f0b71d6301
Author: Heitor Matsui <email address hidden>
Date: Tue Dec 13 12:34:39 2022 -0300

    Use device_path to determine if PV is found

    After upgrading a host we observed a rare disk enumeration
    issue, that would later duplicate PVs, since it will report
    a different device_node and thus not match condition [1].

    This occurs when the same persistent device name used between
    CentOS and Debian points to different kernel derived device
    nodes. This is a unique scenario not previously handled by the
    conductor logic and due to a much later version of systemd/udev
    used in Debian vs. Centos.

    This commit adds logic to fetch device_path earlier and then
    use it to determine if PV is found at the upgraded system,
    leading the PV to being updated instead of created again.

    [1] https://opendev.org/starlingx/config/src/commit/748afd7f5b7d3fc5e958f7173ff1a19c946c73b4/sysinv/sysinv/sysinv/sysinv/conductor/manager.py#L5052

    Test Plan
    PASS: fresh install/bootstrap/unlock
    PASS: host lock/unlock
    PASS: upgrade AIO-DX
    PASS: force the enumeration issue via database and observe that
          existing PV is updated instead of duplicated after agent
          reports back to conductor

    Closes-bug: 1999679

    Change-Id: I43ae44f088c84b45a7a23c46d1ffca4568673e39
    Signed-off-by: Heitor Matsui <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.8.0 stx.config
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.