Brief Description
-----------------
Hypervisor remained down after force lock, unlock (although mtc agent state shows enabled)
Severity
--------
Major
Steps to Reproduce
------------------
1. the test force locks the worker node (compute-0)
| task | Force Locking
| updated_at | 2019-04-14T07:38:44.886194+00:00
2. unlock the same node
[2019-04-14 07:44:20,518] 'system host-unlock compute-0'
3. Confirmed host enabled (after intest)
The task updated:
| task | Rebooting
updated_at | 2019-04-14T07:44:25.981913+00:00
| task | Booting
updated_at | 2019-04-14T07:44:30.790748+00:00
| task | Testing
| updated_at | 2019-04-14T07:47:12.731347+00:00
| task | Enabling
updated_at | 2019-04-14T07:47:57.846186+00:00
[2019-04-14 07:48:16,933] nova hypervisor-list' indicates that compute-0 is still unexpectedly down
+--------------------------------------+---------------------+-------+---------+
| ID | Hypervisor hostname | State | Status |
+--------------------------------------+---------------------+-------+---------+
| 925b8617-8f67-4904-b1a3-1242b138f1d8 | compute-2 | up | enabled |
| 88ca313d-980e-47a5-ac83-2e49e3d46bd5 | compute-0 | down | enabled |
| b321e9cc-f9a9-4429-89d7-a88a4c947414 | compute-1 | up | enabled |
mtcAgent.log reports that the worker node compute-0 is enabled
2019-04-14T07:39:38.877 [107724.00536] controller-0 mtcAgent inv mtcInvApi.cpp ( 437) mtcInvApi_force_task : Info : compute-0 task clear (seq:209) (was:Force Lock Reset in Progress)
...
2019-04-14T07:44:25.643 fmAPI.cpp(471): Enqueue raise alarm request: UUID (2a196c37-efa9-40a4-ac3c-752ea02e3e93) alarm id (200.021) instant id (host=compute-0.command=unlock)
...
2019-04-14T07:47:12.586 [107724.00572] controller-0 mtcAgent inv mtcInvApi.cpp ( 334) mtcInvApi_update_task : Info : compute-0 Task: Testing (seq:222)
...
2019-04-14T07:47:57.739 [107724.00574] controller-0 mtcAgent |-| mtcNodeHdlrs.cpp (1193) enable_handler : Info : compute-0 got GOENABLED
2019-04-14T07:47:57.739 [107724.00575] controller-0 mtcAgent inv mtcInvApi.cpp ( 334) mtcInvApi_update_task : Info : compute-0 Task: Initializing (seq:224)
2019-04-14T07:47:57.744 [107724.00576] controller-0 mtcAgent |-| mtcNodeHdlrs.cpp (1234) enable_handler : Info : compute-0 Starting Host Services
2019-04-14T07:47:57.744 [107724.00577] controller-0 mtcAgent hbs nodeClass.cpp (7282) launch_host_services_cmd: Info : compute-0 start worker host services launch
2019-04-14T07:47:57.744 [107724.00578] controller-0 mtcAgent inv mtcInvApi.cpp ( 334) mtcInvApi_update_task : Info : compute-0 Task: Enabling (seq:225)
2019-04-14T07:47:57.744 [107724.00579] controller-0 mtcAgent msg mtcCtrlMsg.cpp (1140) send_hwmon_command : Info : compute-0 sending 'start' to hwmond service
2019-04-14T07:47:57.812 [107724.00580] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (2631) host_services_handler : Info : compute-0 start worker host services completed
2019-04-14T07:47:59.476 [107724.00581] controller-0 mtcAgent |-| nodeClass.cpp (4772) declare_service_ready : Info : compute-0 got hbsClient ready event
2019-04-14T07:47:59.476 [107724.00582] controller-0 mtcAgent |-| mtcNodeHdlrs.cpp (1362) enable_handler : Info : compute-0 Starting 11 sec Heartbeat Soak (with ready event)
2019-04-14T07:47:59.476 [107724.00583] controller-0 mtcAgent msg mtcCtrlMsg.cpp ( 784) send_hbs_command : Info : compute-0 sending 'start' to controller-0 heartbeat service
2019-04-14T07:47:59.476 [107724.00584] controller-0 mtcAgent msg mtcCtrlMsg.cpp ( 784) send_hbs_command : Info : compute-0 sending 'start' to controller-1 heartbeat service
2019-04-14T07:48:10.476 [107724.00585] controller-0 mtcAgent |-| mtcNodeHdlrs.cpp (1378) enable_handler : Info : compute-0 heartbeating
2019-04-14T07:48:10.481 [107724.00586] controller-0 mtcAgent inv mtcInvApi.cpp ( 437) mtcInvApi_force_task : Info : compute-0 task clear (seq:226) (was:Enabling)
2019-04-14T07:48:10.481 fmAPI.cpp(471): Enqueue raise alarm request: UUID (45172974-8359-4cbf-b7ed-9e82cdb0a3bb) alarm id (200.022) instant id (host=compute-0.state=enabled)
2019-04-14T07:48:10.481 [107724.00587] controller-0 mtcAgent hbs nodeClass.cpp (5785) allStateChange : Info : compute-0 unlocked-enabled-available (seq:227)
2019-04-14T07:48:10.490 fmAlarmUtils.cpp(524): Sending FM raise alarm request: alarm_id (200.022), entity_id (host=compute-0.state=enabled)
2019-04-14T07:48:10.531 fmAlarmUtils.cpp(558): FM Response for raise alarm: (0), alarm_id (200.022), entity_id (host=compute-0.state=enabled)
2019-04-14T07:48:10.648 [107724.00588] controller-0 mtcAgent alm mtcAlarm.cpp ( 396) mtcAlarm_clear : Info : compute-0 clearing 'Configuration' alarm (200.011)
2019-04-14T07:48:10.648 fmAPI.cpp(492): Enqueue clear alarm request: alarm id (200.011), instant id (host=compute-0)
2019-04-14T07:48:10.648 [107724.00589] controller-0 mtcAgent vim mtcVimApi.cpp ( 258) mtcVimApi_state_change : Info : compute-0 sending 'host' state change to vim (enabled)
2019-04-14T07:48:10.648 [107724.00590] controller-0 mtcAgent |-| mtcNodeHdlrs.cpp (1529) enable_handler : Info : compute-0 is ENABLED
nfv-vim.log reports compute-0 exiting state (disabled) at 2019-04-14T07:48:10.659 (but the hypervisor remained down)
2019-04-14T07:26:11.068 controller-0 VIM_Thread[1913] DEBUG _vim_nfvi_events.py.63 Host action, host_uuid=4fe67696-fbde-49c5-88b1-2872bb1bda55, host_name=compute-0, do_action=lock-force.
..
2019-04-14T07:48:10.654 controller-0 VIM_Thread[96769] DEBUG _vim_nfvi_events.py.98 Host state-change, nfvi_host_uuid=4fe67696-fbde-49c5-88b1-2872bb1bda55, nfvi_host_name=compute-0, nfvi_host_admin_state=unlocked, nfvi_host_oper_state=enabled, nfvi_host_avail_status=available.
2019-04-14T07:48:10.657 controller-0 VIM_Thread[96769] DEBUG _host.py.680 Host State-Change detected: nfvi_admin_state=unlocked host_admin_state=unlocked, nfvi_oper_state=enabled host_oper_state=disabled, nfvi_avail_state=available host_avail_status=intest, locking=False unlocking=False fsm current_state=disabled for compute-0.
...
2019-04-14T07:48:10.654 controller-0 VIM_Thread[96769] DEBUG _vim_nfvi_events.py.98 Host state-change, nfvi_host_uuid=4fe67696-fbde-49c5-88b1-2872bb1bda55, nfvi_host_name=compute-0, nfvi_host_admin_state=unlocked, nfvi_host_oper_state=enabled, nfvi_host_avail_status=available.
2019-04-14T07:48:10.657 controller-0 VIM_Thread[96769] DEBUG _host.py.680 Host State-Change detected: nfvi_admin_state=unlocked host_admin_state=unlocked, nfvi_oper_state=enabled host_oper_state=disabled, nfvi_avail_state=available host_avail_status=intest, locking=False unlocking=False fsm current_state=disabled for compute-0.
2019-04-14T07:48:10.659 controller-0 VIM_Thread[96769] INFO _host_state_disabled.py.41 Exiting state (disabled) for compute-0.
2019-04-14T07:48:10.659 controller-0 VIM_Thread[96769] INFO _host_state_enabling.py.27 Entering state (enabling) for compute-0.
2019-04-14T07:48:10.661 controller-0 VIM_Thread[96769] INFO _host.py.839 Host compute-0 FSM State-Change: prev_state=disabled, state=enabling, event=enable.
2019-04-14T07:48:10.662 controller-0 VIM_Thread[96769] INFO _host_director.py.475 Host compute-0 state change notification.
2019-04-14T07:48:10.662 controller-0 VIM_Thread[96769] DEBUG nfvi_infrastructure_api.py.2266 Host rest-api patch path: /nfvi-plugins/v1/hosts/4fe67696-fbde-49c5-88b1-2872bb1bda55.
2019-04-14T07:48:10.844 controller-0 VIM_Infrastructure-Worker-0_Thread[97099] INFO kubernetes_client.py.103 Removing services:NoExecute taint from node compute-0
2019-04-14T07:48:22.263 controller-0 VIM_Thread[96769] INFO _vim_nfvi_audits.py.150 Ignoring audit reply for host compute-0
2019-04-14T07:48:26.411 controller-0 VIM_Thread[96769] INFO _hypervisor.py.133 Hypervisor compute-0 state change was locked-disabled now unlocked-disabled
Expected Behavior
------------------
If unlock the worker node (after forced lock), expect hypervisors to become enabled (as maintenance reports it is)
Actual Behavior
----------------
nfv-vim.log reports compute-0 exiting state (disabled) at 2019-04-14T07:48:10.659 (but the hypervisor remained down)
Reproducibility
---------------
yes
System Configuration
--------------------
2+3
(Lab: WP_3-7
nova/test_force_lock_with_vms.py::test_force_lock_with_non_mig_vms)
Branch/Pull Time/Commit
-----------------------
20190410T013000Z
Timestamp/Logs
--------------
see inline
The system in question has a mismatched mtu on the two controllers.
controller-0 has an mtu of 9000 and controller-1 has an mtu of 1500. This can cause issues.
Unsure if this is contributing to this issue or not. Please fix the lab configuration issue and re-test.