StarlingX

Unlock after force lock enabled the worker according to maintenance but hypervisor remained down

Bug #1824881 reported by Wendy Mitchell on 2019-04-15

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Invalid	Medium	Erich Cordoba

Bug Description

Brief Description
-----------------
Hypervisor remained down after force lock, unlock (although mtc agent state shows enabled)

Severity
--------
Major

Steps to Reproduce
------------------
1. the test force locks the worker node (compute-0)
| task | Force Locking
| updated_at | 2019-04-14T07:38:44.886194+00:00

2. unlock the same node
[2019-04-14 07:44:20,518] 'system host-unlock compute-0'
3. Confirmed host enabled (after intest)

The task updated:

| task | Rebooting
updated_at | 2019-04-14T07:44:25.981913+00:00

| task | Booting
updated_at | 2019-04-14T07:44:30.790748+00:00

| task | Testing
| updated_at | 2019-04-14T07:47:12.731347+00:00

| task | Enabling
updated_at | 2019-04-14T07:47:57.846186+00:00

mtcAgent.log reports that the worker node compute-0 is enabled
2019-04-14T07:39:38.877 [107724.00536] controller-0 mtcAgent inv mtcInvApi.cpp ...
2019-04-14T07:44:25.643 fmAPI.cpp(471): Enqueue raise alarm request: ...
2019-04-14T07:47:12.586 [107724.00572] controller-0 mtcAgent inv mtcInvApi.cpp ...
2019-04-14T07:47:57.739 [107724.00574] controller-0 mtcAgent |-| mtcNodeHdlrs.cpp 2019-04-14T07:47:57.739 [107724.00575] controller-0 mtcAgent inv mtcInvApi.cpp 2019-04-14T07:47:57.744 [107724.00576] controller-0 mtcAgent |-| mtcNodeHdlrs.cpp 2019-04-14T07:47:57.744 [107724.00577] controller-0 mtcAgent hbs nodeClass.cpp 2019-04-14T07:47:57.744 [107724.00578] controller-0 mtcAgent inv mtcInvApi.cpp 2019-04-14T07:47:57.744 [107724.00579] controller-0 mtcAgent msg mtcCtrlMsg.cpp 2019-04-14T07:47:57.812 [107724.00580] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp 2019-04-14T07:47:59.476 [107724.00581] controller-0 mtcAgent |-| nodeClass.cpp 2019-04-14T07:47:59.476 [107724.00582] controller-0 mtcAgent |-| mtcNodeHdlrs.cpp 2019-04-14T07:47:59.476 [107724.00583] controller-0 mtcAgent msg mtcCtrlMsg.cpp 2019-04-14T07:47:59.476 [107724.00584] controller-0 mtcAgent msg mtcCtrlMsg.cpp 2019-04-14T07:48:10.476 [107724.00585] controller-0 mtcAgent |-| mtcNodeHdlrs.cpp 2019-04-14T07:48:10.481 [107724.00586] controller-0 mtcAgent inv mtcInvApi.cpp 2019-04-14T07:48:10.481 fmAPI.cpp(471): Enqueue raise alarm request: 2019-04-14T07:48:10.481 [107724.00587] controller-0 mtcAgent hbs nodeClass.cpp 2019-04-14T07:48:10.490 fmAlarmUtils.cpp(524): Sending 2019-04-14T07:48:10.531 fmAlarmUtils.cpp(558): FM Response 2019-04-14T07:48:10.648 [107724.00588] controller-0 mtcAgent alm mtcAlarm.cpp 2019-04-14T07:48:10.648 fmAPI.cpp(492): Enqueue clear alarm request: 2019-04-14T07:48:10.648 [107724.00589] controller-0 mtcAgent vim mtcVimApi.cpp 2019-04-14T07:48:10.648 [107724.00590] controller-0 mtcAgent |-| mtcNodeHdlrs.cpp ( 437) mtcInvApi_force_task : Info : compute-0 task clear (seq:209) (was:Force Lock Reset in Progress)
UUID (2a196c37-efa9-40a4-ac3c-752ea02e3e93) alarm id (200.021) instant id (host=compute-0.command=unlock)
( 334) mtcInvApi_update_task : Info : compute-0 Task: Testing (seq:222)
(1193) enable_handler : Info : compute-0 got GOENABLED
( 334) mtcInvApi_update_task : Info : compute-0 Task: Initializing (seq:224)
(1234) enable_handler : Info : compute-0 Starting Host Services
(7282) launch_host_services_cmd: Info : compute-0 start worker host services launch
( 334) mtcInvApi_update_task : Info : compute-0 Task: Enabling (seq:225)
(1140) send_hwmon_command : Info : compute-0 sending 'start' to hwmond service
(2631) host_services_handler : Info : compute-0 start worker host services completed
(4772) declare_service_ready : Info : compute-0 got hbsClient ready event
(1362) enable_handler : Info : compute-0 Starting 11 sec Heartbeat Soak (with ready event)
( 784) send_hbs_command : Info : compute-0 sending 'start' to controller-0 heartbeat service
( 784) send_hbs_command : Info : compute-0 sending 'start' to controller-1 heartbeat service
(1378) enable_handler : Info : compute-0 heartbeating
( 437) mtcInvApi_force_task : Info : compute-0 task clear (seq:226) (was:Enabling)
UUID (45172974-8359-4cbf-b7ed-9e82cdb0a3bb) alarm id (200.022) instant id (host=compute-0.state=enabled)
(5785) allStateChange : Info : compute-0 unlocked-enabled-available (seq:227)
FM raise alarm request: alarm_id (200.022), entity_id (host=compute-0.state=enabled)
for raise alarm: (0), alarm_id (200.022), entity_id (host=compute-0.state=enabled)
( 396) mtcAlarm_clear : Info : compute-0 clearing 'Configuration' alarm (200.011)
alarm id (200.011), instant id (host=compute-0)
( 258) mtcVimApi_state_change : Info : compute-0 sending 'host' state change to vim (enabled)
(1529) enable_handler : Info : compute-0 is ENABLED

nfv-vim.log reports compute-0 exiting state (disabled) at 2019-04-14T07:48:10.659 (but the hypervisor remained down)
2019-04-14T07:26:11.068 controller-0 VIM_Thread[1913] DEBUG _vim_nfvi_events.py.63 Host action, host_uuid=4fe67696-fbde-49c5-88b1-2872bb1bda55, host_name=compute-0, do_action=lock-force.

..
2019-04-14T07:48:10.654 controller-0 VIM_Thread[96769] DEBUG _vim_nfvi_events.py.98 Host state-change, nfvi_host_uuid=4fe67696-fbde-49c5-88b1-2872bb1bda55, nfvi_host_name=compute-0, nfvi_host_admin_state=unlocked, nfvi_host_oper_state=enabled, nfvi_host_avail_status=available.
2019-04-14T07:48:10.657 controller-0 VIM_Thread[96769] DEBUG _host.py.680 Host State-Change detected: nfvi_admin_state=unlocked host_admin_state=unlocked, nfvi_oper_state=enabled host_oper_state=disabled, nfvi_avail_state=available host_avail_status=intest, locking=False unlocking=False fsm current_state=disabled for compute-0.
...
2019-04-14T07:48:10.654 controller-0 VIM_Thread[96769] DEBUG _vim_nfvi_events.py.98 Host state-change, nfvi_host_uuid=4fe67696-fbde-49c5-88b1-2872bb1bda55, nfvi_host_name=compute-0, nfvi_host_admin_state=unlocked, nfvi_host_oper_state=enabled, nfvi_host_avail_status=available.
2019-04-14T07:48:10.657 controller-0 VIM_Thread[96769] DEBUG _host.py.680 Host State-Change detected: nfvi_admin_state=unlocked host_admin_state=unlocked, nfvi_oper_state=enabled host_oper_state=disabled, nfvi_avail_state=available host_avail_status=intest, locking=False unlocking=False fsm current_state=disabled for compute-0.
2019-04-14T07:48:10.659 controller-0 VIM_Thread[96769] INFO _host_state_disabled.py.41 Exiting state (disabled) for compute-0.
2019-04-14T07:48:10.659 controller-0 VIM_Thread[96769] INFO _host_state_enabling.py.27 Entering state (enabling) for compute-0.
2019-04-14T07:48:10.661 controller-0 VIM_Thread[96769] INFO _host.py.839 Host compute-0 FSM State-Change: prev_state=disabled, state=enabling, event=enable.
2019-04-14T07:48:10.662 controller-0 VIM_Thread[96769] INFO _host_director.py.475 Host compute-0 state change notification.
2019-04-14T07:48:10.662 controller-0 VIM_Thread[96769] DEBUG nfvi_infrastructure_api.py.2266 Host rest-api patch path: /nfvi-plugins/v1/hosts/4fe67696-fbde-49c5-88b1-2872bb1bda55.
2019-04-14T07:48:10.844 controller-0 VIM_Infrastructure-Worker-0_Thread[97099] INFO kubernetes_client.py.103 Removing services:NoExecute taint from node compute-0
2019-04-14T07:48:22.263 controller-0 VIM_Thread[96769] INFO _vim_nfvi_audits.py.150 Ignoring audit reply for host compute-0
2019-04-14T07:48:26.411 controller-0 VIM_Thread[96769] INFO _hypervisor.py.133 Hypervisor compute-0 state change was locked-disabled now unlocked-disabled

Expected Behavior
------------------
If unlock the worker node (after forced lock), expect hypervisors to become enabled (as maintenance reports it is)

Actual Behavior
----------------
nfv-vim.log reports compute-0 exiting state (disabled) at 2019-04-14T07:48:10.659 (but the hypervisor remained down)

Reproducibility
---------------
yes

System Configuration
--------------------
2+3
(Lab: WP_3-7
nova/test_force_lock_with_vms.py::test_force_lock_with_non_mig_vms)

Branch/Pull Time/Commit
-----------------------
20190410T013000Z

Timestamp/Logs
--------------
see inline

Tags:

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-04-17:

The system in question has a mismatched mtu on the two controllers.
controller-0 has an mtu of 9000 and controller-1 has an mtu of 1500. This can cause issues.

Unsure if this is contributing to this issue or not. Please fix the lab configuration issue and re-test.

tags:	added: stx.containers
Changed in starlingx:
status:	New → Incomplete
assignee:	nobody → Wendy Mitchell (wmitchellwr)

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-04-22:

Download full text (4.7 KiB)

Retested on newer load and confirmed MTUs on both controllers. Force lock and unlock of compute-0
The hypervisor state and status revert to up enabled now.

2019-04-22T19:01:27.450 [37954.00032] compute-0 mtcClient --- mtcNodeComp.cpp (1749) daemon_sigchld_hdlr : Info : PASSED: /etc/goenabled.d/worker-goenabled.sh (0.005 secs)
2019-04-22T19:01:27.454 [37954.00033] compute-0 mtcClient --- mtcNodeComp.cpp (1749) daemon_sigchld_hdlr : Info : PASSED: /etc/goenabled.d/config_goenabled_check.sh (0.010 secs)
2019-04-22T19:01:27.454 [37954.00034] compute-0 mtcClient --- mtcNodeComp.cpp (1749) daemon_sigchld_hdlr : Info : PASSED: /etc/goenabled.d/mlx4_core_goenabled.sh (0.010 secs)
2019-04-22T19:01:27.457 [37954.00035] compute-0 mtcClient --- mtcNodeComp.cpp (1749) daemon_sigchld_hdlr : Info : PASSED: /etc/goenabled.d/patch_check_goenabled.sh (0.012 secs)
2019-04-22T19:01:27.462 [37954.00036] compute-0 mtcClient --- mtcNodeComp.cpp (1749) daemon_sigchld_hdlr : Info : PASSED: /etc/goenabled.d/sysinv_goenabled_check.sh (0.017 secs)
2019-04-22T19:01:28.026 [37954.00037] compute-0 mtcClient --- mtcNodeComp.cpp (1749) daemon_sigchld_hdlr : Info : PASSED: /etc/goenabled.d/virt-support-goenabled.sh (0.582 secs)
2019-04-22T19:01:28.393 [37954.00038] compute-0 mtcClient com nodeUtil.cpp (1201) get_node_health : Info : compute-0 is Healthy (0)
2019-04-22T19:01:28.393 [37954.00039] compute-0 mtcClient msg mtcCompMsg.cpp ( 172) mtc_service_command : Info : GoEnabled In-Progress
2019-04-22T19:02:13.483 [37954.00040] compute-0 mtcClient --- mtcNodeComp.cpp (1749) daemon_sigchld_hdlr : Info : PASSED: /etc/goenabled.d/inventory_goenabled_check.sh (46.039 secs)
2019-04-22T19:02:13.533 [37954.00041] compute-0 mtcClient --- mtcNodeComp.cpp ( 848) _manage_goenabled_tests : Info : GoEnabled Testing Complete ; all tests passed
2019-04-22T19:02:13.533 [37954.00042] compute-0 mtcClient msg mtcCompMsg.cpp ( 324) mtc_service_command : Info : main function goEnabled results acknowledged (Mgmnt)
2019-04-22T19:02:13.549 [37954.00043] compute-0 mtcClient msg mtcCompMsg.cpp ( 261) mtc_service_command : Info : start worker host services request posted (Mgmnt)
2019-04-22T19:02:13.549 [37954.00044] compute-0 mtcClient --- mtcNodeComp.cpp (1370) _launch_all_scripts : Info : Sorted Host Services File List: 1
2019-04-22T19:02:13.549 [37954.00045] compute-0 mtcClient --- mtcNodeComp.cpp (1377) _launch_all_scripts : Info : ... /etc/services.d/worker/mtcTest start
2019-04-22T19:02:13.551 [37954.00046] compute-0 mtcClient --- mtcNodeComp.cpp (1749) daemon_sigchld_hdlr : Info : PASSED: /etc/services.d/worker/mtcTest (0.002 secs)
2019-04-22T19:02:13.602 [37954.00047] compute-0 mtcClient --- mtcNodeComp.cpp ( 692) _manage_services_scripts: Info : Host Services Complete ; all passed

mtcAgent.log

Retested on newer load and confirmed MTUs on both controllers. Force lock and unlock of compute-0 
The hypervisor state and status revert to up enabled now.

2019-04-22T19:01:27.450 [37954.00032] compute-0 mtcClient --- mtcNodeComp.cpp   (1749) daemon_sigchld_hdlr     : Info : PASSED: /etc/goenabled.d/worker-goenabled.sh (0.005 secs)
2019-04-22T19:01:27.454 [37954.00033] compute-0 mtcClient --- mtcNodeComp.cpp   (1749) daemon_sigchld_hdlr     : Info : PASSED: /etc/goenabled.d/config_goenabled_check.sh (0.010 secs)
2019-04-22T19:01:27.454 [37954.00034] compute-0 mtcClient --- mtcNodeComp.cpp   (1749) daemon_sigchld_hdlr     : Info : PASSED: /etc/goenabled.d/mlx4_core_goenabled.sh (0.010 secs)
2019-04-22T19:01:27.457 [37954.00035] compute-0 mtcClient --- mtcNodeComp.cpp   (1749) daemon_sigchld_hdlr     : Info : PASSED: /etc/goenabled.d/patch_check_goenabled.sh (0.012 secs)
2019-04-22T19:01:27.462 [37954.00036] compute-0 mtcClient --- mtcNodeComp.cpp   (1749) daemon_sigchld_hdlr     : Info : PASSED: /etc/goenabled.d/sysinv_goenabled_check.sh (0.017 secs)
2019-04-22T19:01:28.026 [37954.00037] compute-0 mtcClient --- mtcNodeComp.cpp   (1749) daemon_sigchld_hdlr     : Info : PASSED: /etc/goenabled.d/virt-support-goenabled.sh (0.582 secs)
2019-04-22T19:01:28.393 [37954.00038] compute-0 mtcClient com nodeUtil.cpp      (1201) get_node_health         : Info : compute-0 is Healthy (0)
2019-04-22T19:01:28.393 [37954.00039] compute-0 mtcClient msg mtcCompMsg.cpp    ( 172) mtc_service_command     : Info : GoEnabled In-Progress
2019-04-22T19:02:13.483 [37954.00040] compute-0 mtcClient --- mtcNodeComp.cpp   (1749) daemon_sigchld_hdlr     : Info : PASSED: /etc/goenabled.d/inventory_goenabled_check.sh (46.039 secs)
2019-04-22T19:02:13.533 [37954.00041] compute-0 mtcClient --- mtcNodeComp.cpp   ( 848) _manage_goenabled_tests : Info : GoEnabled Testing Complete ; all tests passed
2019-04-22T19:02:13.533 [37954.00042] compute-0 mtcClient msg mtcCompMsg.cpp    ( 324) mtc_service_command     : Info : main function goEnabled results acknowledged (Mgmnt)
2019-04-22T19:02:13.549 [37954.00043] compute-0 mtcClient msg mtcCompMsg.cpp    ( 261) mtc_service_command     : Info : start worker host services request posted (Mgmnt)
2019-04-22T19:02:13.549 [37954.00044] compute-0 mtcClient --- mtcNodeComp.cpp   (1370) _launch_all_scripts     : Info : Sorted Host Services File List: 1
2019-04-22T19:02:13.549 [37954.00045] compute-0 mtcClient --- mtcNodeComp.cpp   (1377) _launch_all_scripts     : Info :  ... /etc/services.d/worker/mtcTest start
2019-04-22T19:02:13.551 [37954.00046] compute-0 mtcClient --- mtcNodeComp.cpp   (1749) daemon_sigchld_hdlr     : Info : PASSED: /etc/services.d/worker/mtcTest (0.002 secs)
2019-04-22T19:02:13.602 [37954.00047] compute-0 mtcClient --- mtcNodeComp.cpp   ( 692) _manage_services_scripts: Info : Host Services Complete ; all passed

mtcAgent.log

2019-04-22T19:02:26.144 [102623.00707] controller-0 mtcAgent hbs nodeClass.cpp     (5785) allStateChange          : Info : compute-0 unlocked-enabled-available (seq:224)
2019-04-22T19:02:26.180 fmAlarmUtils.cpp(524): Sending FM raise alarm request: alarm_id (200.022), entity_id (host=compute-0.state=enabled)
2019-04-22T19:02:26.222 fmAlarmUtils.cpp(558): FM Response for raise alarm: (0), alarm_id (200.022), entity_id (host=compute-0.state=enabled)
2019-04-22T19:02:26.311 [102623.00708] controller-0 mtcAgent alm mtcAlarm.cpp      ( 396) mtcAlarm_clear          : Info : compute-0 clearing 'Configuration' alarm (200.011)
2019-04-22T19:02:26.311 fmAPI.cpp(492): Enqueue clear alarm request: alarm id (200.011), instant id (host=compute-0)
2019-04-22T19:02:26.311 [102623.00709] controller-0 mtcAgent vim mtcVimApi.cpp     ( 258) mtcVimApi_state_change  : Info : compute-0 sending 'host' state change to vim (enabled)
2019-04-22T19:02:26.311 [102623.00710] controller-0 mtcAgent |-| mtcNodeHdlrs.cpp  (1567) enable_handler          : Info : compute-0 is ENABLED
2019-04-22T19:02:26.323 fmAlarmUtils.cpp(530): Sending FM clear alarm request: alarm_id (200.011), entity_id (host=compute-0)
2019-04-22T19:02:26.363 fmAlarmUtils.cpp(569): FM Response for clear alarm: (10), alarm_id (200.011), entity_id (host=compute-0)

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-04-22:

BUILD_TYPE="Formal"
BUILD_ID="20190421T233001Z"

Changed in starlingx:
status:	Incomplete → Confirmed

Revision history for this message

Frank Miller (sensfan22) wrote on 2019-04-24:

Based on Wendy's re-test this is not a bug but was a side effect of a lab MTU size mismatch. Marking state as invalid since there is no bug.

Changed in starlingx:
status:	Confirmed → Invalid

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-04-29:

Download full text (5.7 KiB)

Force lock, then unlock operation still appears to fail sometimes in enabling the hypervisor

eg.
Lab: WP_3_7
Load: 20190410T013000Z

steps
1. Force lock was issue on the host

@ 2019-04-28 02:27:06,105 compute-0 is
administrative | locked
availability | offline
operational | disabled

[2019-04-28 02:27:17,013] compute-0 is in state online
administrative | locked

2. Then, attempt to unlock
[2019-04-28 02:28:12,146]
system host-unlock compute-0

[2019-04-28 02:32:17,593] kubectl get nodes
[2019-04-28 02:32:17,837]
NAME STATUS ROLES AGE VERSION
compute-0 Ready <none> 6h3m v1.13.5
...

compute-0 @ [2019-04-28 03:12:35,026] hypervisor is still down

see nfv-vim.log (on controller-1)

Force lock, then unlock operation still appears to fail sometimes in enabling the hypervisor

eg.
Lab: WP_3_7
Load: 20190410T013000Z

steps 
1. Force lock was issue on the host

@ 2019-04-28 02:27:06,105 compute-0 is 
administrative      | locked                                    
availability        | offline
operational         | disabled

[2019-04-28 02:27:17,013] compute-0 is in state online
administrative      | locked

2. Then, attempt to unlock
[2019-04-28 02:28:12,146] 
system host-unlock compute-0

[2019-04-28 02:32:17,593] kubectl get nodes
[2019-04-28 02:32:17,837]
NAME           STATUS   ROLES    AGE    VERSION
compute-0      Ready    <none>   6h3m   v1.13.5
...

compute-0 @ [2019-04-28 03:12:35,026] hypervisor is still down

see nfv-vim.log (on controller-1)

2019-04-28T02:22:32.556 controller-1                           VIM_Thread[196336]     INFO                      _instance_director.py.278   Force-Lock issued, can't migrate instances off of host compute-0.
2019-04-28T02:22:32.691 controller-1                           VIM_Thread[196336]     INFO                          _host_director.py.408   Notify other directors that the host compute-0 services are disabled.
2019-04-28T02:22:32.691 controller-1                           VIM_Thread[196336]     INFO                      _instance_director.py.1392  Host compute-0 services disabled.
...
2019-04-28T02:22:33.176 controller-1                           VIM_Thread[196336]     INFO                             _hypervisor.py.133   Hypervisor compute-0 state change was locked-enabled now locked-disabled 
...
2019-04-28T02:22:33.177 controller-1                           VIM_Thread[196336]     INFO                      _instance_director.py.1427  Host compute-0 disabled.
...
2019-04-28T02:32:25.314 controller-1                           VIM_Thread[196336]     INFO                             _hypervisor.py.133   Hypervisor compute-0 state change was locked-disabled now unlocked-disabled 
2019-04-28T02:33:04.939 controller-1                           VIM_Thread[196336]     INFO                         _host_task_work.py.840   Queueing rebalance for host compute-0 enable
2019-04-28T02:33:05.091 controller-1                           VIM_Thread[196336]    DEBUG                             _host_tasks.py.185   Task (enable-host_compute-0) complete.

sysinv.log

2019-04-28 02:28:15.361 196672 WARNING sysinv.api.controllers.v1.vim_api [-] vim_host_action hostname=compute-0, action=unlock
2019-04-28 02:28:15.361 196672 WARNING sysinv.api.controllers.v1.vim_api [-] vim_host_action hostname=compute-0, action=unlock  api_cmd=http://localhost:30001/nfvi-plugins/v1/hosts/c75b9b25-6eb4-4ecf-a477-26514df2c47e headers={'Content-type': 'application/json', 'User-Agent': 'sysinv/1.0'} payload={'action': 'unlock', 'hostname': u'compute-0', 'uuid': u'c75b9b25-6eb4-4ecf-a477-26514df2c47e'}
2019-04-28 02:28:15.361 196672 INFO sysinv.api.controllers.v1.rest_api [-] PATCH cmd:http://localhost:30001/nfvi-plugins/v1/hosts/c75b9b25-6eb4-4ecf-a477-26514df2c47e hdr:{'Content-type': 'application/json', 'User-Agent': 'sysinv/1.0'} payload:{"action": "unlock", "hostname": "compute-0", "uuid": "c75b9b25-6eb4-4ecf-a477-26514df2c47e"}
2019-04-28 02:28:15.365 196672 INFO sysinv.api.controllers.v1.rest_api [-] Response={}
2019-04-28 02:28:15.365 196672 INFO sysinv.api.controllers.v1.host [-] compute-0 Perform configure_ihost.
2019-04-28 02:28:15.495 196672 INFO sysinv.openstack.common.rpc.common [-] Connected to AMQP server on 192.168.204.2:5672
2019-04-28 02:28:15.734 195542 ERROR stevedore.extension [-] Could not load 'stx-openstack': No module named systemconfig.helm_plugins.stx_openstack
2019-04-28 02:28:18.132 196672 INFO sysinv.api.controllers.v1.host [-] compute-0 Action unlock perform notify_mtce
2019-04-28 02:28:18.188 196672 INFO sysinv.api.controllers.v1.mtce_api [-] number of calls to rest_api_request=1 (max_retry=3)
...
2019-04-28 02:33:04.966 196670 INFO sysinv.api.controllers.v1.host [-] compute-0 2. delta_handle ['action']
2019-04-28 02:33:04.966 196670 INFO sysinv.api.controllers.v1.host [-] compute-0 post delta_handle hostupdate action=services-enabled notify_vim=False notify_mtc=False skip_notify_mtce=True
2019-04-28 02:33:04.967 196670 INFO sysinv.api.controllers.v1.host [-] compute-0 apply ihost_val {'vim_progress_status': 'services-enabled'}
2019-04-28 02:33:05.022 196670 INFO sysinv.api.controllers.v1.host [-] compute-0 notify_availability=services-enabled
2019-04-28 02:33:05.027 195542 INFO sysinv.conductor.manager [req-1fccb6dc-1c11-47f4-8bee-6c89de2886f3 admin admin] Updating platform data for host: c75b9b25-6eb4-4ecf-a477-26514df2c47e with: {u'availability': u'services-enabled'}

Changed in starlingx:
status:	Invalid → Confirmed
tags:	added: stx.retestneeded

Revision history for this message

Frank Miller (sensfan22) wrote on 2019-05-01:

Based on Wendy's re-test this issue occurred at least once. From her timeline the compute-2 nova hypervisor remained down 40 minutes after compute-2 went enabled after being unlocked.

Marking as stx.2.0 gating since the nova hypervisor needs to recover after any unlock of a compute.

Changed in starlingx:
status:	Confirmed → Triaged
importance:	Undecided → Medium
assignee:	Wendy Mitchell (wmitchellwr) → Gerry Kopec (gerry-kopec)
tags:	added: stx.2.0

Revision history for this message

Bill Zvonar (billzvonar) wrote on 2019-06-21:

Assigned to Bruce for re-assignment.

Changed in starlingx:
assignee:	Gerry Kopec (gerry-kopec) → Bruce Jones (brucej)

Bruce Jones (brucej) on 2019-06-21

Changed in starlingx:
assignee:	Bruce Jones (brucej) → Abraham Arce (xe1gyq)

Abraham Arce (xe1gyq) on 2019-07-09

Changed in starlingx:
assignee:	Abraham Arce (xe1gyq) → Erich Cordoba (ericho)

Revision history for this message

Erich Cordoba (ericho) wrote on 2019-07-17:

I was able to reproduce this issue, the hypervisor went down after the lock/unlock. I need to figure it out which component is responsible to bring the hypervisor up to make a further debug. I'll keep updating this issue with the findings.

Revision history for this message

Erich Cordoba (ericho) wrote on 2019-08-07:

The last attempt to root cause this issue was with this build:

http://mirror.starlingx.cengn.ca/mirror/starlingx/master/centos/20190804T233000Z/outputs/iso/

I'm kind of blocked because of https://bugs.launchpad.net/starlingx/+bug/1838472 however I was able to reproduce twice. This time the hypervisor was `down` but after around 5 minutes changed to `up`.

I checked the "nova-compute" logs there wasn't any change when the transition from down to up happened. I also reviewed rabbitmq logs but I didn't found any useful information.

I'm still not sure which component causes the transition from down to up.

Revision history for this message

Erich Cordoba (ericho) wrote on 2019-08-08:

#10

Download full text (3.2 KiB)

I'm unable to reproduce this bug in my virtual environment.

I did a force lock on compute-1, then I checked the output of openstack compute service list:

Then I did the unlock, five minutes later the compute-1 booted the state changed to up:

+----+------------------+-----------------------------------+----------+---------+-------+----------------------------+
| ID | Binary | Host | Zone | Status | State | Updated At |
+----+------------------+-----------------------------------+----------+---------+-------+----------------------------+
| 7 | nova-consoleauth | nova-consoleauth-7cb854f756-q7sx9 | internal | enabled | up | 2019-08-08T21:47:13.000000 |
| 10 | nova-conductor | nova-conductor-5cdcd7d59b-f4tfn | internal | enabled | up | 2019-08-08T21:47:10.000000 |
| 13 | nova-scheduler | nova-scheduler-767b47d569-mcvth | internal | enabled | up | 2019-08-08T21:47:13.000000 |
| 16 | nova-consoleauth | nova-consoleauth-7cb854f756-6sktp | internal | enabled | up | 2019-08-08T21:47:13.000000 |
| 22 | nova-scheduler | nova-scheduler-767b47d569-58w5j | internal | enabled | up | 2019-08-08T21:47:15.000000 |
| 25 | nova-conductor | nova-conductor-5cdcd7d59b-82xzp | internal | enabled | up | 2019-08-08T21:47:13.000000 |
| 28 | nova-compute | compute-1 | nova | enabled | up | 2019-08-08T21:47:16.000000 |
| 31 | nova-compute | compute-0 | nova | enabled | up | 2019-08-08T21:47:15.000000 |
+----+------------------+-----------------------------------+----------+---------+-------+----------------------------+

Is this st...

I'm unable to reproduce this bug in my virtual environment.

I did a force lock on compute-1, then I checked the output of openstack compute service list:

Then I did the unlock, five minutes later the compute-1 booted the state changed to up:

+----+------------------+-----------------------------------+----------+---------+-------+----------------------------+
| ID | Binary           | Host                              | Zone     | Status  | State | Updated At                 |
+----+------------------+-----------------------------------+----------+---------+-------+----------------------------+
|  7 | nova-consoleauth | nova-consoleauth-7cb854f756-q7sx9 | internal | enabled | up    | 2019-08-08T21:47:13.000000 |
| 10 | nova-conductor   | nova-conductor-5cdcd7d59b-f4tfn   | internal | enabled | up    | 2019-08-08T21:47:10.000000 |
| 13 | nova-scheduler   | nova-scheduler-767b47d569-mcvth   | internal | enabled | up    | 2019-08-08T21:47:13.000000 |
| 16 | nova-consoleauth | nova-consoleauth-7cb854f756-6sktp | internal | enabled | up    | 2019-08-08T21:47:13.000000 |
| 22 | nova-scheduler   | nova-scheduler-767b47d569-58w5j   | internal | enabled | up    | 2019-08-08T21:47:15.000000 |
| 25 | nova-conductor   | nova-conductor-5cdcd7d59b-82xzp   | internal | enabled | up    | 2019-08-08T21:47:13.000000 |
| 28 | nova-compute     | compute-1                         | nova     | enabled | up    | 2019-08-08T21:47:16.000000 |
| 31 | nova-compute     | compute-0                         | nova     | enabled | up    | 2019-08-08T21:47:15.000000 |
+----+------------------+-----------------------------------+----------+---------+-------+----------------------------+

Is this still reproducible in your side?

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-08-15:

#11

Download full text (4.1 KiB)

The original lab I reproduced this on was not available today.
This testcase was run on a different h/w lab (wcp63-66) this time with a newer build 2019-08-12_20-59-00

The tests are passing (2 out of 2 attempts)
Only thing of note is in step f, the critical CPU alarms.

the instance reaches states: {'status': 'ERROR'}

Alarms observed as follows:
700.001 Instance tenant1-vm-1 owned by tenant1 has failed on host compute-1 tenant=79ad4133-8937-4154-87f0-23bd72c71412.instance=8a3202e0-9d80-4058-b993-fc9820aef3c6 critical 2019-08-15T20:20:50
200.006 compute-1 is degraded due to the failure of its 'pci-irq-affinity-agent' process. Auto recovery of this major process is in progress. host=compute-1.process=pci-irq-affinity-agent major 2019-08-15T20:31:24
750.004 Application Apply In Progress k8s_application=stx-openstack warning 2019-08-15T20:26:52

e. stx-application apply appears to complete and the worker becomes available (hypervisor is enabled at this point now)

The original lab I reproduced this on was not available today. 
This testcase was run on a different h/w lab (wcp63-66) this time with a newer build 2019-08-12_20-59-00

The tests are passing (2 out of 2 attempts)
Only thing of note is in step f, the critical CPU alarms.

the instance reaches states: {'status': 'ERROR'}

Alarms observed as follows:
700.001 	Instance tenant1-vm-1 owned by tenant1 has failed on host compute-1 	tenant=79ad4133-8937-4154-87f0-23bd72c71412.instance=8a3202e0-9d80-4058-b993-fc9820aef3c6 	critical 	2019-08-15T20:20:50 	
200.006 	compute-1 is degraded due to the failure of its 'pci-irq-affinity-agent' process. Auto recovery of this major process is in progress. 	host=compute-1.process=pci-irq-affinity-agent 	major 	2019-08-15T20:31:24 	
750.004 	Application Apply In Progress 	k8s_application=stx-openstack 	warning 	2019-08-15T20:26:52

e. stx-application apply appears to complete and the worker becomes available (hypervisor is enabled at this point now)

*** f. CPU threshold alarms appear (critical level) and the host is degraded still (hypervisor remains up.enabled though

100.101 	Platform CPU threshold exceeded ; threshold 95.00%, actual 99.56% 	host=compute-1 	critical 	2019-08-15T20:33:19 	
700.001 	Instance tenant1-vm-1 owned by tenant1 has failed on host compute-1 	tenant=79ad4133-8937-4154-87f0-23bd72c71412.instance=8a3202e0-9d80-4058-b993-fc9820aef3c6 	critical 	2019-08-15T20:20:50

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-08-15:

#12

If you want to mark the LP for retest, it should be retested again on the original lab (wp 3-7)

Revision history for this message

Cindy Xie (xxie1) wrote on 2019-08-23:

#13

reproduced in 20190822 master ISO sanity tests.

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-08-23:

#14

As per agreement with the community, moving all unresolved medium priority bugs from stx.2.0 to stx.3.0

tags:

added: stx.3.0
removed: stx.2.0

Revision history for this message

Frank Miller (sensfan22) wrote on 2019-11-22:

#15

Erich please update this LP with your current status. What are your plans to debug and address this LP?

Revision history for this message

Erich Cordoba (ericho) wrote on 2019-11-27:

#16

In all the attempts I was unable to reproduce this issue and a retest by the reporter was required.

I'll try to reproduce again I'll report the results here.

Revision history for this message

Peng Peng (ppeng) wrote on 2019-11-29:

#17

Issue was not reproduced on train
2019-11-21_20-00-00
wcp_3-6

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-11-29:

#18

Closing; as per above, issue is not reproducible on train

Changed in starlingx:
status:	Triaged → Invalid

Yang Liu (yliu12) on 2020-01-30

tags:

removed: stx.retestneeded

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.