Load-import hangs indefinitely in importing state

Bug #1962779 reported by Virginia Martins Perozim
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Virginia Martins Perozim

Bug Description

Brief Description
-----------------
Shortly after receiving a load-import request, sysinv conductor restarted the load import hangs indefinitely.

Severity
--------
Major: upgrade fails

Steps to Reproduce
------------------
Update dcmanager orchestration code to induce importing the entire bootimage.iso to the simplex subclouds in DC-2. Note: code change is required for this test due to lack of DC labs with 50 AIO-DX subclouds.
Perform 50 concurrent load imports via upgrade orchestration

Expected Behavior
------------------
Load import is successful

Actual Behavior
----------------
In the middle of load import, sysinv conductor restarted. After 30 minute wait, the upgrade thread on the system controller exited with an error. The load import state remains as importing indefinitely.

Reproducibility
---------------
Intermittently

System Configuration
--------------------
IPv6

Branch/Pull Time/Commit
-----------------------
master

Last Pass
---------
Below the notes from last test:

except for https://review.opendev.org/c/starlingx/distcloud/+/813269 - this update only affects proxied load-import on the SystemController; and not the subcloud.

The updates for load-import were verified with 50 subclouds: https://review.opendev.org/c/starlingx/config/+/812098

Timestamp/Logs
--------------
sysinv 2021-12-01 17:13:59.211 99321 INFO sysinv.api.controllers.v1.load [-] Load import request received.

sysinv 2021-12-01 17:15:45.560 99016 INFO oslo_service.service [-] Caught SIGTERM, stopping children

sysinv 2021-12-01 17:15:45.566 99016 INFO oslo.service.wsgi [-] Stopping WSGI server.

sysinv 2021-12-01 17:15:45.566 99016 INFO oslo_service.service [-] Waiting on 1 children to exit

sysinv 2021-12-01 17:15:45.567 99321 INFO oslo.service.wsgi [-] Stopping WSGI server.

sysinv 2021-12-01 17:16:02.057 99322 INFO oslo_service.service [-] Parent process has died unexpectedly, exiting

sysinv 2021-12-01 17:16:02.059 99322 INFO oslo.service.wsgi [-] Stopping WSGI server.

sysinv 2021-12-01 17:16:02.059 99322 INFO oslo.service.wsgi [-] Stopping WSGI server.

sysinv 2021-12-01 17:16:02.066 99321 INFO oslo_service.service [-] Parent process has died unexpectedly, exiting

sysinv 2021-12-01 17:16:02.067 99321 INFO oslo.service.wsgi [-] Stopping WSGI server.

Test Activity
-------------
Developer Test

Workaround
----------
N/A

Changed in starlingx:
assignee: nobody → Virginia Martins Perozim (vmperozim)
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/config/+/832789

Ghada Khalil (gkhalil)
tags: added: stx.config stx.distcloud
tags: removed: stx.config
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/config/+/869224

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on config (master)

Change abandoned by "Virginia Martins Perozim <email address hidden>" on branch: master
Review: https://review.opendev.org/c/starlingx/config/+/832789
Reason: Solution in https://review.opendev.org/c/starlingx/config/+/869224

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/c/starlingx/config/+/869224
Committed: https://opendev.org/starlingx/config/commit/9ad267fe448fd2aa402fff9e49f235d189c55068
Submitter: "Zuul (22348)"
Branch: master

commit 9ad267fe448fd2aa402fff9e49f235d189c55068
Author: Heitor Matsui <email address hidden>
Date: Wed Jan 4 11:41:31 2023 -0300

    Unstuck loads hung in importing state

    There are known situations where a load might get stuck in
    a transient state, like "importing", such as if conductor
    restarts during the load import process. In this case, the
    load stays permanently in importing state and there are no
    mechanisms to remove it from this stuck state, currently.

    This commit adds the functionality to set loads in transient
    state to 'error' and unmount them from the temporary directory
    during conductor startup, so that the load can be deleted and
    reimported.

    Test Plan
    PASS: run load import, restart conductor to simulate the failure
          and verify that the load is unmounted and set to error state
          after conductor comes back online, then verify that load can
          be deleted and reimported successfully
    PASS: regression - import load successfully in a "no-error" scenario
    PASS: run both tests with and without load-import "--local" option
    PASS: all of the above with "--os-region-name SystemController"

    Relates-to: https://review.opendev.org/c/starlingx/config/+/832789
    Closes-bug: 1962779

    Signed-off-by: Heitor Matsui <email address hidden>
    Change-Id: I909ebcd9448974d95200c6d5c79e7428b232d0a8

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.8.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.