StarlingX

After Controller lock and unlock, Controller goes to failed state

Bug #1999679 reported by Heitor Matsui on 2022-12-14

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	Medium	Heitor Matsui

Bug Description

Brief Description
-----------------
After the host lock and unlock on controller-0 it is going to a failed state.

Severity
--------
Critical: unable to use subclouds

Steps to Reproduce
------------------
1. Install DC lab with 22.06 and upgrade patches
2. Follow upgrade steps for DC and upgrade Central and sub cloud .
3. lock and unlock Controller-0 and verify the state of the host

Expected Behavior
------------------
No failure. host state

Reproducibility
---------------
Intermittent

System Configuration
--------------------
DC

Branch/Pull Time/Commit
-----------------------
2022-11-30_22-00-06

Last Pass
---------
N/A

Timestamp/Logs
--------------
puppet.log:
Error: 2022-12-03 03:45:22 +0000 Evaluation Error: The title '/dev/disk/by-path/pci-0000:00:1f.2-ata-5.0-part4' has already been used in this resource expression (file: /usr/share/puppet/modules/platform/manifests/lvm.pp, line: 57, column: 3) on node controller-0

Test Activity
-------------
Regression test

Workaround
----------
Manually delete from the database the duplicated pv records not in use, then lock/unlock the host

Tags:

OpenStack Infra (hudson-openstack) on 2022-12-14

Changed in starlingx:
status:	New → In Progress

Heitor Matsui (heitormatsui) on 2022-12-14

Changed in starlingx:
assignee:	nobody → Heitor Matsui (heitormatsui)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-12-14: Fix merged to config (master)

Reviewed: https://review.opendev.org/c/starlingx/config/+/867525
Committed: https://opendev.org/starlingx/config/commit/ca6bc490b7a7df646827e80395e4a5f0b71d6301
Submitter: "Zuul (22348)"
Branch: master

commit ca6bc490b7a7df646827e80395e4a5f0b71d6301
Author: Heitor Matsui <email address hidden>
Date: Tue Dec 13 12:34:39 2022 -0300

Use device_path to determine if PV is found

    After upgrading a host we observed a rare disk enumeration
    issue, that would later duplicate PVs, since it will report
    a different device_node and thus not match condition [1].

    This occurs when the same persistent device name used between
    CentOS and Debian points to different kernel derived device
    nodes. This is a unique scenario not previously handled by the
    conductor logic and due to a much later version of systemd/udev
    used in Debian vs. Centos.

    This commit adds logic to fetch device_path earlier and then
    use it to determine if PV is found at the upgraded system,
    leading the PV to being updated instead of created again.

[1] https://opendev.org/starlingx/config/src/commit/748afd7f5b7d3fc5e958f7173ff1a19c946c73b4/sysinv/sysinv/sysinv/sysinv/conductor/manager.py#L5052

    Test Plan
    PASS: fresh install/bootstrap/unlock
    PASS: host lock/unlock
    PASS: upgrade AIO-DX
    PASS: force the enumeration issue via database and observe that
          existing PV is updated instead of duplicated after agent
          reports back to conductor

Closes-bug: 1999679

Change-Id: I43ae44f088c84b45a7a23c46d1ffca4568673e39
Signed-off-by: Heitor Matsui <email address hidden>

Changed in starlingx:
status:	In Progress → Fix Released

Ghada Khalil (gkhalil) on 2022-12-16

Changed in starlingx:
importance:	Undecided → Medium
tags:	added: stx.8.0 stx.config

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.