Unlock after force lock enabled the worker according to maintenance but hypervisor remained down

Bug #1824881 reported by Wendy Mitchell
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Invalid
Medium
Erich Cordoba

Bug Description

Brief Description
-----------------
Hypervisor remained down after force lock, unlock (although mtc agent state shows enabled)

Severity
--------
Major

Steps to Reproduce
------------------
1. the test force locks the worker node (compute-0)
| task | Force Locking
| updated_at | 2019-04-14T07:38:44.886194+00:00

2. unlock the same node
[2019-04-14 07:44:20,518] 'system host-unlock compute-0'
3. Confirmed host enabled (after intest)

The task updated:

| task | Rebooting
 updated_at | 2019-04-14T07:44:25.981913+00:00

| task | Booting
 updated_at | 2019-04-14T07:44:30.790748+00:00

| task | Testing
| updated_at | 2019-04-14T07:47:12.731347+00:00

| task | Enabling
 updated_at | 2019-04-14T07:47:57.846186+00:00

[2019-04-14 07:48:16,933] nova hypervisor-list' indicates that compute-0 is still unexpectedly down
+--------------------------------------+---------------------+-------+---------+
| ID | Hypervisor hostname | State | Status |
+--------------------------------------+---------------------+-------+---------+
| 925b8617-8f67-4904-b1a3-1242b138f1d8 | compute-2 | up | enabled |
| 88ca313d-980e-47a5-ac83-2e49e3d46bd5 | compute-0 | down | enabled |
| b321e9cc-f9a9-4429-89d7-a88a4c947414 | compute-1 | up | enabled |

mtcAgent.log reports that the worker node compute-0 is enabled
2019-04-14T07:39:38.877 [107724.00536] controller-0 mtcAgent inv mtcInvApi.cpp ( 437) mtcInvApi_force_task : Info : compute-0 task clear (seq:209) (was:Force Lock Reset in Progress)
...
2019-04-14T07:44:25.643 fmAPI.cpp(471): Enqueue raise alarm request: UUID (2a196c37-efa9-40a4-ac3c-752ea02e3e93) alarm id (200.021) instant id (host=compute-0.command=unlock)
...
2019-04-14T07:47:12.586 [107724.00572] controller-0 mtcAgent inv mtcInvApi.cpp ( 334) mtcInvApi_update_task : Info : compute-0 Task: Testing (seq:222)
...
2019-04-14T07:47:57.739 [107724.00574] controller-0 mtcAgent |-| mtcNodeHdlrs.cpp (1193) enable_handler : Info : compute-0 got GOENABLED
2019-04-14T07:47:57.739 [107724.00575] controller-0 mtcAgent inv mtcInvApi.cpp ( 334) mtcInvApi_update_task : Info : compute-0 Task: Initializing (seq:224)
2019-04-14T07:47:57.744 [107724.00576] controller-0 mtcAgent |-| mtcNodeHdlrs.cpp (1234) enable_handler : Info : compute-0 Starting Host Services
2019-04-14T07:47:57.744 [107724.00577] controller-0 mtcAgent hbs nodeClass.cpp (7282) launch_host_services_cmd: Info : compute-0 start worker host services launch
2019-04-14T07:47:57.744 [107724.00578] controller-0 mtcAgent inv mtcInvApi.cpp ( 334) mtcInvApi_update_task : Info : compute-0 Task: Enabling (seq:225)
2019-04-14T07:47:57.744 [107724.00579] controller-0 mtcAgent msg mtcCtrlMsg.cpp (1140) send_hwmon_command : Info : compute-0 sending 'start' to hwmond service
2019-04-14T07:47:57.812 [107724.00580] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (2631) host_services_handler : Info : compute-0 start worker host services completed
2019-04-14T07:47:59.476 [107724.00581] controller-0 mtcAgent |-| nodeClass.cpp (4772) declare_service_ready : Info : compute-0 got hbsClient ready event
2019-04-14T07:47:59.476 [107724.00582] controller-0 mtcAgent |-| mtcNodeHdlrs.cpp (1362) enable_handler : Info : compute-0 Starting 11 sec Heartbeat Soak (with ready event)
2019-04-14T07:47:59.476 [107724.00583] controller-0 mtcAgent msg mtcCtrlMsg.cpp ( 784) send_hbs_command : Info : compute-0 sending 'start' to controller-0 heartbeat service
2019-04-14T07:47:59.476 [107724.00584] controller-0 mtcAgent msg mtcCtrlMsg.cpp ( 784) send_hbs_command : Info : compute-0 sending 'start' to controller-1 heartbeat service
2019-04-14T07:48:10.476 [107724.00585] controller-0 mtcAgent |-| mtcNodeHdlrs.cpp (1378) enable_handler : Info : compute-0 heartbeating
2019-04-14T07:48:10.481 [107724.00586] controller-0 mtcAgent inv mtcInvApi.cpp ( 437) mtcInvApi_force_task : Info : compute-0 task clear (seq:226) (was:Enabling)
2019-04-14T07:48:10.481 fmAPI.cpp(471): Enqueue raise alarm request: UUID (45172974-8359-4cbf-b7ed-9e82cdb0a3bb) alarm id (200.022) instant id (host=compute-0.state=enabled)
2019-04-14T07:48:10.481 [107724.00587] controller-0 mtcAgent hbs nodeClass.cpp (5785) allStateChange : Info : compute-0 unlocked-enabled-available (seq:227)
2019-04-14T07:48:10.490 fmAlarmUtils.cpp(524): Sending FM raise alarm request: alarm_id (200.022), entity_id (host=compute-0.state=enabled)
2019-04-14T07:48:10.531 fmAlarmUtils.cpp(558): FM Response for raise alarm: (0), alarm_id (200.022), entity_id (host=compute-0.state=enabled)
2019-04-14T07:48:10.648 [107724.00588] controller-0 mtcAgent alm mtcAlarm.cpp ( 396) mtcAlarm_clear : Info : compute-0 clearing 'Configuration' alarm (200.011)
2019-04-14T07:48:10.648 fmAPI.cpp(492): Enqueue clear alarm request: alarm id (200.011), instant id (host=compute-0)
2019-04-14T07:48:10.648 [107724.00589] controller-0 mtcAgent vim mtcVimApi.cpp ( 258) mtcVimApi_state_change : Info : compute-0 sending 'host' state change to vim (enabled)
2019-04-14T07:48:10.648 [107724.00590] controller-0 mtcAgent |-| mtcNodeHdlrs.cpp (1529) enable_handler : Info : compute-0 is ENABLED

nfv-vim.log reports compute-0 exiting state (disabled) at 2019-04-14T07:48:10.659 (but the hypervisor remained down)
2019-04-14T07:26:11.068 controller-0 VIM_Thread[1913] DEBUG _vim_nfvi_events.py.63 Host action, host_uuid=4fe67696-fbde-49c5-88b1-2872bb1bda55, host_name=compute-0, do_action=lock-force.

..
2019-04-14T07:48:10.654 controller-0 VIM_Thread[96769] DEBUG _vim_nfvi_events.py.98 Host state-change, nfvi_host_uuid=4fe67696-fbde-49c5-88b1-2872bb1bda55, nfvi_host_name=compute-0, nfvi_host_admin_state=unlocked, nfvi_host_oper_state=enabled, nfvi_host_avail_status=available.
2019-04-14T07:48:10.657 controller-0 VIM_Thread[96769] DEBUG _host.py.680 Host State-Change detected: nfvi_admin_state=unlocked host_admin_state=unlocked, nfvi_oper_state=enabled host_oper_state=disabled, nfvi_avail_state=available host_avail_status=intest, locking=False unlocking=False fsm current_state=disabled for compute-0.
...
2019-04-14T07:48:10.654 controller-0 VIM_Thread[96769] DEBUG _vim_nfvi_events.py.98 Host state-change, nfvi_host_uuid=4fe67696-fbde-49c5-88b1-2872bb1bda55, nfvi_host_name=compute-0, nfvi_host_admin_state=unlocked, nfvi_host_oper_state=enabled, nfvi_host_avail_status=available.
2019-04-14T07:48:10.657 controller-0 VIM_Thread[96769] DEBUG _host.py.680 Host State-Change detected: nfvi_admin_state=unlocked host_admin_state=unlocked, nfvi_oper_state=enabled host_oper_state=disabled, nfvi_avail_state=available host_avail_status=intest, locking=False unlocking=False fsm current_state=disabled for compute-0.
2019-04-14T07:48:10.659 controller-0 VIM_Thread[96769] INFO _host_state_disabled.py.41 Exiting state (disabled) for compute-0.
2019-04-14T07:48:10.659 controller-0 VIM_Thread[96769] INFO _host_state_enabling.py.27 Entering state (enabling) for compute-0.
2019-04-14T07:48:10.661 controller-0 VIM_Thread[96769] INFO _host.py.839 Host compute-0 FSM State-Change: prev_state=disabled, state=enabling, event=enable.
2019-04-14T07:48:10.662 controller-0 VIM_Thread[96769] INFO _host_director.py.475 Host compute-0 state change notification.
2019-04-14T07:48:10.662 controller-0 VIM_Thread[96769] DEBUG nfvi_infrastructure_api.py.2266 Host rest-api patch path: /nfvi-plugins/v1/hosts/4fe67696-fbde-49c5-88b1-2872bb1bda55.
2019-04-14T07:48:10.844 controller-0 VIM_Infrastructure-Worker-0_Thread[97099] INFO kubernetes_client.py.103 Removing services:NoExecute taint from node compute-0
2019-04-14T07:48:22.263 controller-0 VIM_Thread[96769] INFO _vim_nfvi_audits.py.150 Ignoring audit reply for host compute-0
2019-04-14T07:48:26.411 controller-0 VIM_Thread[96769] INFO _hypervisor.py.133 Hypervisor compute-0 state change was locked-disabled now unlocked-disabled

Expected Behavior
------------------
If unlock the worker node (after forced lock), expect hypervisors to become enabled (as maintenance reports it is)

Actual Behavior
----------------
nfv-vim.log reports compute-0 exiting state (disabled) at 2019-04-14T07:48:10.659 (but the hypervisor remained down)

Reproducibility
---------------
yes

System Configuration
--------------------
2+3
(Lab: WP_3-7
nova/test_force_lock_with_vms.py::test_force_lock_with_non_mig_vms)

Branch/Pull Time/Commit
-----------------------
20190410T013000Z

Timestamp/Logs
--------------
see inline

Revision history for this message
Ghada Khalil (gkhalil) wrote :

The system in question has a mismatched mtu on the two controllers.
controller-0 has an mtu of 9000 and controller-1 has an mtu of 1500. This can cause issues.

Unsure if this is contributing to this issue or not. Please fix the lab configuration issue and re-test.

tags: added: stx.containers
Changed in starlingx:
status: New → Incomplete
assignee: nobody → Wendy Mitchell (wmitchellwr)
Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :
Download full text (4.7 KiB)

Retested on newer load and confirmed MTUs on both controllers. Force lock and unlock of compute-0
The hypervisor state and status revert to up enabled now.

2019-04-22T19:01:27.450 [37954.00032] compute-0 mtcClient --- mtcNodeComp.cpp (1749) daemon_sigchld_hdlr : Info : PASSED: /etc/goenabled.d/worker-goenabled.sh (0.005 secs)
2019-04-22T19:01:27.454 [37954.00033] compute-0 mtcClient --- mtcNodeComp.cpp (1749) daemon_sigchld_hdlr : Info : PASSED: /etc/goenabled.d/config_goenabled_check.sh (0.010 secs)
2019-04-22T19:01:27.454 [37954.00034] compute-0 mtcClient --- mtcNodeComp.cpp (1749) daemon_sigchld_hdlr : Info : PASSED: /etc/goenabled.d/mlx4_core_goenabled.sh (0.010 secs)
2019-04-22T19:01:27.457 [37954.00035] compute-0 mtcClient --- mtcNodeComp.cpp (1749) daemon_sigchld_hdlr : Info : PASSED: /etc/goenabled.d/patch_check_goenabled.sh (0.012 secs)
2019-04-22T19:01:27.462 [37954.00036] compute-0 mtcClient --- mtcNodeComp.cpp (1749) daemon_sigchld_hdlr : Info : PASSED: /etc/goenabled.d/sysinv_goenabled_check.sh (0.017 secs)
2019-04-22T19:01:28.026 [37954.00037] compute-0 mtcClient --- mtcNodeComp.cpp (1749) daemon_sigchld_hdlr : Info : PASSED: /etc/goenabled.d/virt-support-goenabled.sh (0.582 secs)
2019-04-22T19:01:28.393 [37954.00038] compute-0 mtcClient com nodeUtil.cpp (1201) get_node_health : Info : compute-0 is Healthy (0)
2019-04-22T19:01:28.393 [37954.00039] compute-0 mtcClient msg mtcCompMsg.cpp ( 172) mtc_service_command : Info : GoEnabled In-Progress
2019-04-22T19:02:13.483 [37954.00040] compute-0 mtcClient --- mtcNodeComp.cpp (1749) daemon_sigchld_hdlr : Info : PASSED: /etc/goenabled.d/inventory_goenabled_check.sh (46.039 secs)
2019-04-22T19:02:13.533 [37954.00041] compute-0 mtcClient --- mtcNodeComp.cpp ( 848) _manage_goenabled_tests : Info : GoEnabled Testing Complete ; all tests passed
2019-04-22T19:02:13.533 [37954.00042] compute-0 mtcClient msg mtcCompMsg.cpp ( 324) mtc_service_command : Info : main function goEnabled results acknowledged (Mgmnt)
2019-04-22T19:02:13.549 [37954.00043] compute-0 mtcClient msg mtcCompMsg.cpp ( 261) mtc_service_command : Info : start worker host services request posted (Mgmnt)
2019-04-22T19:02:13.549 [37954.00044] compute-0 mtcClient --- mtcNodeComp.cpp (1370) _launch_all_scripts : Info : Sorted Host Services File List: 1
2019-04-22T19:02:13.549 [37954.00045] compute-0 mtcClient --- mtcNodeComp.cpp (1377) _launch_all_scripts : Info : ... /etc/services.d/worker/mtcTest start
2019-04-22T19:02:13.551 [37954.00046] compute-0 mtcClient --- mtcNodeComp.cpp (1749) daemon_sigchld_hdlr : Info : PASSED: /etc/services.d/worker/mtcTest (0.002 secs)
2019-04-22T19:02:13.602 [37954.00047] compute-0 mtcClient --- mtcNodeComp.cpp ( 692) _manage_services_scripts: Info : Host Services Complete ; all passed

mtcAgent.log

2019-04-22T19:02:26.144 [102623.00707] controller-0 mtcAgent hbs nodeClass.cpp (5785) allStateChange : Info : compute-0 unlocked-enabled-available (seq:224)
2019-04-22T19:02:26.180 fmAlarmUtils.cpp(524): Sending FM raise alarm request: alarm_id (200.022), entity_id (host=compute-0.state=en...

Read more...

Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :

BUILD_TYPE="Formal"
BUILD_ID="20190421T233001Z"

Changed in starlingx:
status: Incomplete → Confirmed
Revision history for this message
Frank Miller (sensfan22) wrote :

Based on Wendy's re-test this is not a bug but was a side effect of a lab MTU size mismatch. Marking state as invalid since there is no bug.

Changed in starlingx:
status: Confirmed → Invalid
Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :
Download full text (5.7 KiB)

Force lock, then unlock operation still appears to fail sometimes in enabling the hypervisor

eg.
Lab: WP_3_7
Load: 20190410T013000Z

steps
1. Force lock was issue on the host

@ 2019-04-28 02:27:06,105 compute-0 is
administrative | locked
availability | offline
operational | disabled

[2019-04-28 02:27:17,013] compute-0 is in state online
administrative | locked

@ [2019-04-28 02:28:10,711] the worker node is online,locked,disabled
| 3 | compute-0 | worker | locked | disabled | online

2. Then, attempt to unlock
[2019-04-28 02:28:12,146]
system host-unlock compute-0

@ [2019-04-28 02:31:14,323] system host-list
| 3 | compute-0 | worker | unlocked | disabled | intest

@[2019-04-28 02:32:12,754]
| 3 | compute-0 | worker | unlocked | enabled | available

[2019-04-28 02:32:17,593] kubectl get nodes
[2019-04-28 02:32:17,837]
NAME STATUS ROLES AGE VERSION
compute-0 Ready <none> 6h3m v1.13.5
...

@[2019-04-28 02:32:35,550] nova hypervisor-list
| ac079d74-397f-4d36-be23-c20edcca0ecc | compute-0 | down | enabled |

compute-0 @ [2019-04-28 03:12:35,026] hypervisor is still down

+--------------------------------------+---------------------+-------+---------+
| ID | Hypervisor hostname | State | Status |
+--------------------------------------+---------------------+-------+---------+
| ac079d74-397f-4d36-be23-c20edcca0ecc | compute-0 | down | enabled |
| 0ef7728d-9285-44f0-8df4-3b7dd5bb80b6 | compute-2 | up | enabled |
| 24775101-1ff0-40dc-b65b-3f870ec7a37c | compute-1 | up | enabled |
+--------------------------------------+---------------------+-------+---------+

see nfv-vim.log (on controller-1)

2019-04-28T02:22:32.556 controller-1 VIM_Thread[196336] INFO _instance_director.py.278 Force-Lock issued, can't migrate instances off of host compute-0.
2019-04-28T02:22:32.691 controller-1 VIM_Thread[196336] INFO _host_director.py.408 Notify other directors that the host compute-0 services are disabled.
2019-04-28T02:22:32.691 controller-1 VIM_Thread[196336] INFO _instance_director.py.1392 Host compute-0 services disabled.
...
2019-04-28T02:22:33.176 controller-1 VIM_Thread[196336] INFO _hypervisor.py.133 Hypervisor compute-0 state change was locked-enabled now locked-disabled
...
2019-04-28T02:22:33.177 controller-1 VIM_Thread[196336] INFO _instance_director.py.1427 Host compute-0 disabled.
...
2019-04-28T02:32:25.314 controller-1 VIM_Thread[196336] INFO _hypervisor.py.133 Hypervisor compute-0 state change was locked-disabled now unlocked-disabled
2019-04-28T02:33:04.939 controller-1 VIM_Thread[196336] INFO _host_task_work.py.84...

Read more...

Changed in starlingx:
status: Invalid → Confirmed
tags: added: stx.retestneeded
Revision history for this message
Frank Miller (sensfan22) wrote :

Based on Wendy's re-test this issue occurred at least once. From her timeline the compute-2 nova hypervisor remained down 40 minutes after compute-2 went enabled after being unlocked.

Marking as stx.2.0 gating since the nova hypervisor needs to recover after any unlock of a compute.

Changed in starlingx:
status: Confirmed → Triaged
importance: Undecided → Medium
assignee: Wendy Mitchell (wmitchellwr) → Gerry Kopec (gerry-kopec)
tags: added: stx.2.0
Revision history for this message
Bill Zvonar (billzvonar) wrote :

Assigned to Bruce for re-assignment.

Changed in starlingx:
assignee: Gerry Kopec (gerry-kopec) → Bruce Jones (brucej)
Bruce Jones (brucej)
Changed in starlingx:
assignee: Bruce Jones (brucej) → Abraham Arce (xe1gyq)
Abraham Arce (xe1gyq)
Changed in starlingx:
assignee: Abraham Arce (xe1gyq) → Erich Cordoba (ericho)
Revision history for this message
Erich Cordoba (ericho) wrote :

I was able to reproduce this issue, the hypervisor went down after the lock/unlock. I need to figure it out which component is responsible to bring the hypervisor up to make a further debug. I'll keep updating this issue with the findings.

Revision history for this message
Erich Cordoba (ericho) wrote :

The last attempt to root cause this issue was with this build:

http://mirror.starlingx.cengn.ca/mirror/starlingx/master/centos/20190804T233000Z/outputs/iso/

I'm kind of blocked because of https://bugs.launchpad.net/starlingx/+bug/1838472 however I was able to reproduce twice. This time the hypervisor was `down` but after around 5 minutes changed to `up`.

I checked the "nova-compute" logs there wasn't any change when the transition from down to up happened. I also reviewed rabbitmq logs but I didn't found any useful information.

I'm still not sure which component causes the transition from down to up.

Revision history for this message
Erich Cordoba (ericho) wrote :
Download full text (3.2 KiB)

I'm unable to reproduce this bug in my virtual environment.

I did a force lock on compute-1, then I checked the output of openstack compute service list:

controller-0:~$ openstack compute service list
+----+------------------+-----------------------------------+----------+----------+-------+----------------------------+
| ID | Binary | Host | Zone | Status | State | Updated At |
+----+------------------+-----------------------------------+----------+----------+-------+----------------------------+
| 7 | nova-consoleauth | nova-consoleauth-7cb854f756-q7sx9 | internal | enabled | up | 2019-08-08T21:41:53.000000 |
| 10 | nova-conductor | nova-conductor-5cdcd7d59b-f4tfn | internal | enabled | up | 2019-08-08T21:42:00.000000 |
| 13 | nova-scheduler | nova-scheduler-767b47d569-mcvth | internal | enabled | up | 2019-08-08T21:41:52.000000 |
| 16 | nova-consoleauth | nova-consoleauth-7cb854f756-6sktp | internal | enabled | up | 2019-08-08T21:41:53.000000 |
| 22 | nova-scheduler | nova-scheduler-767b47d569-58w5j | internal | enabled | up | 2019-08-08T21:41:55.000000 |
| 25 | nova-conductor | nova-conductor-5cdcd7d59b-82xzp | internal | enabled | up | 2019-08-08T21:41:53.000000 |
| 28 | nova-compute | compute-1 | nova | disabled | down | 2019-08-08T21:28:00.000000 |
| 31 | nova-compute | compute-0 | nova | enabled | up | 2019-08-08T21:41:55.000000 |
+----+------------------+-----------------------------------+----------+----------+-------+----------------------------+

Then I did the unlock, five minutes later the compute-1 booted the state changed to up:

+----+------------------+-----------------------------------+----------+---------+-------+----------------------------+
| ID | Binary | Host | Zone | Status | State | Updated At |
+----+------------------+-----------------------------------+----------+---------+-------+----------------------------+
| 7 | nova-consoleauth | nova-consoleauth-7cb854f756-q7sx9 | internal | enabled | up | 2019-08-08T21:47:13.000000 |
| 10 | nova-conductor | nova-conductor-5cdcd7d59b-f4tfn | internal | enabled | up | 2019-08-08T21:47:10.000000 |
| 13 | nova-scheduler | nova-scheduler-767b47d569-mcvth | internal | enabled | up | 2019-08-08T21:47:13.000000 |
| 16 | nova-consoleauth | nova-consoleauth-7cb854f756-6sktp | internal | enabled | up | 2019-08-08T21:47:13.000000 |
| 22 | nova-scheduler | nova-scheduler-767b47d569-58w5j | internal | enabled | up | 2019-08-08T21:47:15.000000 |
| 25 | nova-conductor | nova-conductor-5cdcd7d59b-82xzp | internal | enabled | up | 2019-08-08T21:47:13.000000 |
| 28 | nova-compute | compute-1 | nova | enabled | up | 2019-08-08T21:47:16.000000 |
| 31 | nova-compute | compute-0 | nova | enabled | up | 2019-08-08T21:47:15.000000 |
+----+------------------+-----------------------------------+----------+---------+-------+----------------------------+

Is this st...

Read more...

Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :
Download full text (4.1 KiB)

The original lab I reproduced this on was not available today.
This testcase was run on a different h/w lab (wcp63-66) this time with a newer build 2019-08-12_20-59-00

The tests are passing (2 out of 2 attempts)
Only thing of note is in step f, the critical CPU alarms.

After forced lock
a. the worker node becomes
3 | compute-1 | worker | locked | disabled | offline
Thu Aug 15 20:21:01 UTC 2019
hypervisor becomes
| 3 | compute-1 | worker | locked | disabled | offline

the instance reaches states: {'status': 'ERROR'}

b. the worker node then becomes online (briefly) then offline again
| 3 | compute-1 | worker | locked | disabled | online
...
| 3 | compute-1 | worker | unlocked | disabled | offline

c. On unlock, the worker node eventually moves to intest while the hypervisor is still down
| 3 | compute-1 | worker | unlocked | disabled | intest
Thu Aug 15 20:30:59 UTC 2019

$ nova hypervisor-list;date
+--------------------------------------+---------------------+-------+----------+
| ID | Hypervisor hostname | State | Status |
+--------------------------------------+---------------------+-------+----------+
| 31616106-5acb-41ae-bde1-fa7aee9d14c8 | compute-1 | down | disabled |
| 8aebd9e0-6dc7-4f03-b793-7a1a24d43c73 | compute-0 | up | enabled |
+--------------------------------------+---------------------+-------+----------+

Alarms observed as follows:
700.001 Instance tenant1-vm-1 owned by tenant1 has failed on host compute-1 tenant=79ad4133-8937-4154-87f0-23bd72c71412.instance=8a3202e0-9d80-4058-b993-fc9820aef3c6 critical 2019-08-15T20:20:50
200.006 compute-1 is degraded due to the failure of its 'pci-irq-affinity-agent' process. Auto recovery of this major process is in progress. host=compute-1.process=pci-irq-affinity-agent major 2019-08-15T20:31:24
750.004 Application Apply In Progress k8s_application=stx-openstack warning 2019-08-15T20:26:52

d. The worker node becomes degraded due to 200.006 and hypervisor state is down
| 3 | compute-1 | worker | unlocked | enabled | degraded
Thu Aug 15 20:32:25 UTC 2019$ nova hypervisor-list;date

+--------------------------------------+---------------------+-------+---------+
| ID | Hypervisor hostname | State | Status |
+--------------------------------------+---------------------+-------+---------+
| 31616106-5acb-41ae-bde1-fa7aee9d14c8 | compute-1 | down | enabled |
| 8aebd9e0-6dc7-4f03-b793-7a1a24d43c73 | compute-0 | up | enabled |
+--------------------------------------+---------------------+-------+---------+

e. stx-application apply appears to complete and the worker becomes available (hypervisor is enabled at this point now)

| 3 | compute-1 | worker | unlocked | enabled | available |
Thu Aug 15 20:33:15 UTC 2019

$ nova hypervisor-list;date
+--------------------------------------+---------------------+-------+---------+
| ID | Hypervisor hostname | State | Status |
+--...

Read more...

Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :

If you want to mark the LP for retest, it should be retested again on the original lab (wp 3-7)

Revision history for this message
Cindy Xie (xxie1) wrote :

reproduced in 20190822 master ISO sanity tests.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

As per agreement with the community, moving all unresolved medium priority bugs from stx.2.0 to stx.3.0

tags: added: stx.3.0
removed: stx.2.0
Revision history for this message
Frank Miller (sensfan22) wrote :

Erich please update this LP with your current status. What are your plans to debug and address this LP?

Revision history for this message
Erich Cordoba (ericho) wrote :

In all the attempts I was unable to reproduce this issue and a retest by the reporter was required.

I'll try to reproduce again I'll report the results here.

Revision history for this message
Peng Peng (ppeng) wrote :

Issue was not reproduced on train
2019-11-21_20-00-00
wcp_3-6

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Closing; as per above, issue is not reproducible on train

Changed in starlingx:
status: Triaged → Invalid
Yang Liu (yliu12)
tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.