During load-import fail scenario, load is kept and can not be removed

Bug #1937168 reported by Adriano Oliveira
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Low
Unassigned

Bug Description

Brief Description
-----------------
system load-import failed during upgrade tests. There is no way to abort or delete the failed import.

Severity
--------
Minor: System/Feature is usable with minor issue

Steps to Reproduce
------------------
Attempt image load:

system load-import bootimage.iso bootimage.sig

Load failed with error:

 Timeout while waiting on RPC response - topic: "sysinv.conductor_manager", RPC method: "start_import_load" info: "<unknown>"".

System fails to report aborted image import and is stuck in this state for ever. Reboot did not help. Delete or re-import will not work:

[sysadmin@controller-0 backups(keystone_admin)]$ system load-import bootimage.iso bootimage.sig
Max number of loads (2) reached. Please remove the old or unused load before importing a new one.
[sysadmin@controller-0 backups(keystone_admin)]$

Expected Behavior
-----------------
1. "system load-list" should reported correct state (i.e. not importing)

2. there should be a method for aborting or deleting the failed task

Should this happen in production environment it would halt the upgrade until a solution is provided.

Actual Behavior
---------------
system load-list claims an active import when there is none running.

Reproducibility
---------------
Seen once

System Configuration
--------------------
AIO-SX

Branch/Pull Time/Commit
-----------------------
n/a

Last Pass
---------
n/a.

Timestamp/Logs
--------------
...

sysinv.log:sysinv 2021-06-30 08:18:57.333 129725 INFO sysinv.api.controllers.v1.load [-] Load import request received.

sysinv.log:sysinv 2021-06-30 08:34:44.496 129725 ERROR wsme.api [-] Server-side error: "Timeout while waiting on RPC response - topic: "sysinv.conductor_manager", RPC method: "start_import_load" info: "<unknown>"". Detail:

sysinv.log: File "/usr/lib64/python2.7/site-packages/sysinv/api/controllers/v1/load.py", line 292, in import_loadsysinv.log: return self._import_load()sysinv.log: File "/usr/lib64/python2.7/site-packages/sysinv/api/controllers/v1/load.py", line 346, in _import_load

sysinv.log: system_controller_import_active)

sysinv.log: File "/usr/lib64/python2.7/site-packages/sysinv/conductor/rpcapi.py", line 1226, in start_import_load

sysinv.log: import_active=import_active))sysinv.log:Timeout: Timeout while waiting on RPC response - topic: "sysinv.conductor_manager", RPC method: "start_import_load" info: "<unknown>"

sysinv.log:: Timeout: Timeout while waiting on RPC response - topic: "sysinv.conductor_manager", RPC method: "start_import_load" info: "<unknown>"

sysinv.log:sysinv 2021-06-30 10:48:08.612 129725 ERROR wsme.api [-] Server-side error: "Only a load in an imported or error state can be deleted". Detail:

sysinv.log: _("Only a load in an imported or error state can be deleted"))

sysinv.log:SysinvException: Only a load in an imported or error state can be deleted

...

[sysadmin@controller-0 backups(keystone_admin)]$ date
Thu Jul 1 11:24:29 UTC 2021
[sysadmin@controller-0 backups(keystone_admin)]$ system load-delete 2
Only a load in an imported or error state can be deleted
[sysadmin@controller-0 backups(keystone_admin)]$ uptime
{{ 11:24:43 up 1 day, 34 min, 1 user, load average: 2.54, 2.90, 3.45}}
[sysadmin@controller-0 backups(keystone_admin)]$

Test Activity
-------------
Feature testing

Workaround
----------
n/a

summary: - During load-import fail scenario load is kept and can not be removed
+ During load-import fail scenario, load is kept and can not be removed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/config/+/801706

Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/c/starlingx/config/+/801706
Committed: https://opendev.org/starlingx/config/commit/5b93236b60f2ee58b2a8631badb88e4eca0ffb5e
Submitter: "Zuul (22348)"
Branch: master

commit 5b93236b60f2ee58b2a8631badb88e4eca0ffb5e
Author: Adriano Oliveira <email address hidden>
Date: Thu Jul 22 02:47:36 2021 -0400

    Remove load image file in case of import error

    In a scenario in which the AMQP Server lost connection, a TimeoutError
    occurred on the start_import_load RPC call to sysinv conductor.
    This error was not handled so the temporary file was not removed.
    The current staging cleanup routine was extended to handle
    TimeoutError. Also, when writing the file to disk a possible out of
    space error was detected and the file removed.

    Tests were performed by killing sysinv-conductor and generating the
    TimeoutError on the RPC call to conductor.
    Also, no space available was simulated by adding files to the
    /scratch partition.
    Scenario of successful load-import was also tested, as well
    load-delete.

    Closes-Bug: 1937168
    Signed-off-by: Adriano Oliveira <email address hidden>
    Change-Id: Ic23787df9a765a9374b4dd33f9242ee000cbf75c

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/config/+/804771

Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Low
tags: added: stx.6.0 stx.update
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.