controller-0 degraded after system unlock

Bug #1922256 reported by Marcus Secato
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Pedro Henrique Linhares

Bug Description

Brief Description
-----------------
After ansible playbook is done on DX-plus system, controller-0 unlocks, but stuck at degraded.

Severity
--------
Critical

Steps to Reproduce
------------------
install DX-Plus

Expected Behavior
------------------
after controller-0 unlock, controller-0 should become available

Actual Behavior
----------------
after controller-0 unlock, controller-0 stuck at degraded

Reproducibility
---------------
First time seen this issue.

System Configuration
--------------------
AIO+WORKER Multi-node system

Branch/Pull Time/Commit
-----------------------
2021-03-19_00-00-09

Last Pass
---------
2021-03-11_00-00-07

Test Activity
-------------
Sanity

Revision history for this message
Marcus Secato (mviniciu) wrote :

Traceback (most recent call last):
  File "/usr/bin/manage-partitions", line 895, in <module>
    main(sys.argv)
  File "/usr/bin/manage-partitions", line 888, in main
    CONF.action.mode, CONF.action.pfile)
  File "/usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py", line 328, in inner
    return f(*args, **kwargs)
  File "/usr/bin/manage-partitions", line 870, in run
    action(data, mode, pfile)
  File "/usr/bin/manage-partitions", line 496, in create_partitions
    free_spaces = _get_free_space(device_node=disk_device_path)
  File "/usr/lib64/python2.7/site-packages/sysinv/common/utils.py", line 1400, in wrapper
    fcntl.flock(f, fcntl.LOCK_SH | fcntl.LOCK_NB)
IOError: [Errno 11] Resource temporarily unavailable

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/c/starlingx/config/+/784482
Committed: https://opendev.org/starlingx/config/commit/cbb9121a289603ec003dec098b8fa5918ca98300
Submitter: "Zuul (22348)"
Branch: master

commit cbb9121a289603ec003dec098b8fa5918ca98300
Author: Marcus Secato <email address hidden>
Date: Thu Apr 1 16:30:13 2021 -0400

    Adjust lock acquiring logic

    When locking the file descriptor skip_udev_partition_probe was not
    handling errors thrown by fcntl.flock which was leading controller-0
    to degraded state after unlock. This change aims to strengthen that
    logic by handling the error properly, retrying the lock operation and
    improving logs.

    Closes-Bug: 1922256

    Signed-off-by: Marcus Secato <email address hidden>
    Change-Id: I000367668744a4e92e20ff9d3f1f8cd717883a46

Changed in starlingx:
status: New → Fix Released
Marcus Secato (mviniciu)
Changed in starlingx:
assignee: nobody → Marcus Secato (mviniciu)
Ghada Khalil (gkhalil)
tags: added: stx.config
Revision history for this message
Ghada Khalil (gkhalil) wrote :

screening: issue reported once and appears to be intermittent. Therefore, will mark it for the next release. Submission is stx master is sufficient; no need to cherrypick to the r/stx.5.0 release branch.

Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.6.0
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Re-opening as the commit is being reverted as it introduced other issues: https://review.opendev.org/c/starlingx/config/+/786959

Changed in starlingx:
status: Fix Released → Triaged
Changed in starlingx:
assignee: Marcus Secato (mviniciu) → Pedro Henrique Linhares (linharesp)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/config/+/791302

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/c/starlingx/config/+/791302
Committed: https://opendev.org/starlingx/config/commit/703b9dc6f97628271950f9b7352fdb8d4df8f74d
Submitter: "Zuul (22348)"
Branch: master

commit 703b9dc6f97628271950f9b7352fdb8d4df8f74d
Author: Pedro Henrique Linhares <email address hidden>
Date: Wed May 12 16:51:19 2021 -0300

    Refactor and expose logic to acquire a flock with retries

    When locking the file descriptor skip_udev_partition_probe was not
    handling errors thrown by fcntl.flock which was leading controller-0
    to degraded state after unlock. This change aims to strengthen that
    logic by handling the error properly, retrying the lock operation and
    improving logs.

    Re-implementation of commit cbb9121a289603ec003dec098b8fa5918ca98300.
    The original commit inadvertently replaced a shared lock with a
    exclusive lock on the decorator skip_udev_partition_probe which caused
    fd locking issues.

    This commit exposes utility functions to acquire shared or exclusive
    non-blocking locks of file descriptors.

    Tested on Standard (2 + 4) and AIO-Simplex configurations. Ran sanity
    load on both.

    Closes-Bug: 1922256
    Change-Id: Ifcddab027df955152f420fd7451f42167694a31a
    Signed-off-by: Pedro Henrique Linhares <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/793460

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/793696

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/794611

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/794906

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on config (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/794611

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (f/centos8)
Download full text (147.3 KiB)

Reviewed: https://review.opendev.org/c/starlingx/config/+/794906
Committed: https://opendev.org/starlingx/config/commit/75758b37a5a23c8811355b67e2a430a1713cd85b
Submitter: "Zuul (22348)"
Branch: f/centos8

commit 9e420d9513e5fafb1df4d29567bc299a9e04d58d
Author: Bin Qian <email address hidden>
Date: Mon May 31 14:45:52 2021 -0400

    Add more logging to run docker login

    Add error log for running docker login. The new log could
    help identify docker login failure.

    Closes-Bug: 1930310
    Change-Id: I8a709fb6665de8301fbe3022563499a92b2a0211
    Signed-off-by: Bin Qian <email address hidden>

commit 31c77439d2cea590dfcca13cfa646522665f8686
Author: albailey <email address hidden>
Date: Fri May 28 13:42:42 2021 -0500

    Fix controller-0 downgrade failing to kill ceph

    kill_ceph_storage_monitor tried to manipulate a pmon
    file that does not exist in an AIO-DX environment.

    We no longer invoke kill_ceph_storage_monitor in an
    AIO SX or DX env.

    This allows: "system host-downgrade controller-0"
    to proceed in an AIO-DX environment where that second
    controller (controller-0) was upgraded.

    Partial-Bug: 1929884
    Signed-off-by: albailey <email address hidden>
    Change-Id: I633853f75317736084feae96b5b849c601204c13

commit 0dc99eee608336fe01b58821ea404286371f1408
Author: albailey <email address hidden>
Date: Fri May 28 11:05:43 2021 -0500

    Fix file permissions failure during duplex upgrade abort

    When issuing a downgrade for controller-0 in a duplex upgrade
    abort and rollback scenario, the downgrade command was failing
    because the sysinv API does not have root permissions to set
    a file flag.
    The fix is to use RPC so the conductor can create the flag
    and allow the downgrade for controller-0 to get further.

    Partial-Bug: 1929884
    Signed-off-by: albailey <email address hidden>
    Change-Id: I913bcad73309fe887a12cbb016a518da93327947

commit 7ef3724dad173754e40b45538b1cc726a458cc1c
Author: Chen, Haochuan Z <email address hidden>
Date: Tue May 25 16:16:29 2021 +0800

    Fix bug rook-ceph provision with multi osd on one host

    Test case:
    1, deploy simplex system
    2, apply rook-ceph with below override value
    value.yaml
    cluster:
      storage:
        nodes:
        - name: controller-0
          devices:
          - name: sdb
          - name: sdc
    3, reboot

    Without this fix, only osd pod could launch successfully after boot
    as vg start with ceph could not correctly add in sysinv-database

    Closes-bug: 1929511

    Change-Id: Ia5be599cd168d13d2aab7b5e5890376c3c8a0019
    Signed-off-by: Chen, Haochuan Z <email address hidden>

commit 23505ba77d76114cf8a0bf833f9a5bcd05bc1dd1
Author: Angie Wang <email address hidden>
Date: Tue May 25 18:49:21 2021 -0400

    Fix issue in partition data migration script

    The created partition dictonary partition_map is not
    an ordered dict so we need to sort it by its key -
    device node when iterating it to adjust the device
    nodes/paths for user created extra partitions to ensure
    the number of device node...

tags: added: in-f-centos8
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on config (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/793696

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/793460

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.