compute node lock resulted in vim-services on controller unexpectedly going to active-failed

Bug #1824027 reported by Wendy Mitchell
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Kevin Smith

Bug Description

Brief Description
-----------------
locking compute host unexpectedly fails vim-services on the active controller
(Appears to be an OpenStack Rest-API Exception with the L3/DHCP Agent rescheduling)

Severity
--------
standard

Steps to Reproduce
------------------
1. controller-0 and controller-1 enabled online state (controller-0 is the active controller)
worker nodes (compute-0, compute-1 and compute-2) all enabled and online
2. lock eg. compute-2 and check vim-services state

Expected Behavior
------------------
locking compute host should not fail the vim-services

Actual Behavior
----------------
vim service failed unexpectedly on the active controller when locking compute node

2019-04-09T20:44:51.113687 log 200.022 compute-2 is now 'disabled' host=compute-2.state=disabled not-applicable
...
2019-04-09T20:44:51.572559 set 200.001 compute-2 was administratively locked to take it out-of-service. host=compute-2 warning
...
2019-04-09T20:46:58.680853 log 401.001 Service group vim-services state change from active to active-failed on host controller-0; vim(disabled, failed) service_domain=controller.service_group=vim-services.host=controller-0 critical

vim-services disabled-failed active-failed

see var/log/sm-customer.log (from active controller-0)
| 2019-04-09T20:45:30.240 | 344 | service-scn | vim | enabled-active | disabled | process (pid=72780) failed
| 2019-04-09T20:45:30.706 | 345 | service-scn | vim | disabled | enabling | enabled-active state requested
| 2019-04-09T20:45:30.800 | 346 | service-scn | vim | enabling | enabled-active | enable success
| 2019-04-09T20:46:58.313 | 347 | service-scn | vim | enabled-active | disabled | process (pid=221031) failed
| 2019-04-09T20:46:58.680 | 348 | service-scn | vim | disabled | disabled-failed | enabled-active state requested
| 2019-04-09T20:46:58.680 | 349 | service-group-scn | vim-services | active | active-failed | vim(disabled, failed)
| 2019-04-09T20:46:59.687 | 350 | service-group-scn | vim-services | active-failed | disabling-failed | vim(disabled, failed)
| 2019-04-09T20:46:59.688 | 351 | service-group-scn | cloud-services | active | disabling |
| 2019-04-09T20:46:59.689 | 352 | service-group-scn | controller-services | active | disabling |
| 2019-04-09T20:46:59.690 | 353 | service-group-scn | oam-services | active | disabling |

Reproducibility
---------------
yes

System Configuration
--------------------
2+3
(Lab yow-cgcs-wolfpass-8-12)

Branch/Pull Time/Commit
--------------------
BUILD_ID="20190408T233001Z"

Last Pass

---------

Timestamp/Logs
--------------

Issue appears to be with the L3/DHCP Agent rescheduling.
see /var/log/nfv-vim.log (attached) on controller-0 from the time of the lock

2019-04-09T20:44:09.122 controller-0 VIM_Thread[72780] INFO _network_rebalance.py.1065 Triggering L3 Agent reschedule for disabled l3 agent host: compute-2
2019-04-09T20:44:09.122 controller-0 VIM_Thread[72780] INFO _dhcp_rebalance.py.866 Triggering DHCP Agent reschedule for disabled dhcp agent host: compute-2
2019-04-09T20:44:15.477 controller-0 VIM_Network-Worker-0_Thread[160705] INFO _timer_scheduler.py.57 Not scheduling on time, elapsed=122412 ms.
2019-04-09T20:44:15.478 controller-0 VIM_Network-Worker-0_Thread[160705] INFO _thread.py.218 Thread Network-Worker-0: not scheduling on time
2019-04-09T20:44:15.478 controller-0 VIM_Network-Worker-0_Thread[160705] INFO _thread.py.218 Thread Network-Worker-0: not scheduling on time
2019-04-09T20:44:15.478 controller-0 VIM_Network-Worker-0_Thread[160705] INFO _thread.py.232 Thread Network-Worker-0: shutting down.
2019-04-09T20:44:15.478 controller-0 VIM_Network-Worker-0_Thread[160705] INFO _thread.py.236 Thread Network-Worker-0: shutdown.
2019-04-09T20:44:18.578 controller-0 VIM_Thread[72780] ERROR Caught exception while trying to add network to agent, error=[OpenStack Rest-API Exception: method=POST, url=http://neutron-server.openstack.svc.cluster.local:9696//v2.0/agents/b8a4a817-3f90-41b7-b846-e635b5304cac/dhcp-networks, headers={'Content-Type': 'application/json'}, body={"network_id": "3a0da180-b80c-4d94-b7cb-99292ff469f6"}, status_code=404, reason=HTTP Error 404: Not Found, response_headers=[('Date', 'Tue, 09 Apr 2019 20:44:18 GMT'), ('Content-Length', '165'), ('Content-Type', 'application/json'), ('X-Openstack-Request-Id', 'req-8b7b7cad-6918-40e4-bdf5-3c51bb94cbf0')], response_body={"NeutronError": {"message": "Agent b8a4a817-3f90-41b7-b846-e635b5304cac is not a valid DHCP Agent or has been disabled.", "type": "InvalidDHCPAgent", "detail": ""}}].
Traceback (most recent call last):
  File "/usr/lib64/python2.7/site-packages/nfv_plugins/nfvi_plugins/nfvi_network_api.py", line 1518, in add_network_to_dhcp_agent
    future.result = (yield)
OpenStackRestAPIException: [OpenStack Rest-API Exception: method=POST, url=http://neutron-server.openstack.svc.cluster.local:9696//v2.0/agents/b8a4a817-3f90-41b7-b846-e635b5304cac/dhcp-networks, headers={'Content-Type': 'application/json'}, body={"network_id": "3a0da180-b80c-4d94-b7cb-99292ff469f6"}, status_code=404, reason=HTTP Error 404: Not Found, response_headers=[('Date', 'Tue, 09 Apr 2019 20:44:18 GMT'), ('Content-Length', '165'), ('Content-Type', 'application/json'), ('X-Openstack-Request-Id', 'req-8b7b7cad-6918-40e4-bdf5-3c51bb94cbf0')], response_body={"NeutronError": {"message": "Agent b8a4a817-3f90-41b7-b846-e635b5304cac is not a valid DHCP Agent or has been disabled.", "type": "InvalidDHCPAgent", "detail": ""}}]
2019-04-09T20:44:18.579 controller-0 VIM_Thread[72780] WARNING _dhcp_rebalance.py.390 Unable to add network to dhcp agent, response = {'completed': False, 'reason': '', 'result-data': ''}

also controller-1 nfv-vim.log

2019-04-09T20:43:57.778 controller-0 VIM_Thread[72780] INFO _host_state_disabling.py.27 Entering state (disabling) for compute-2.

2019-04-09T20:43:57.782 controller-0 VIM_Thread[72780] INFO _host.py.839 Host compute-2 FSM State-Change: prev_state=enabled, state=disabling, event=task-completed.

2019-04-09T20:43:57.784 controller-0 VIM_Thread[72780] INFO _host_director.py.475 Host compute-2 state change notification.

2019-04-09T20:43:57.811 controller-0 VIM_Event-Log_Thread[72818] INFO fm.py.379 Generated customer log, fm_uuid=6c380c47-ff5d-4e34-80ce-7e4977e51f21.
2019-04-09T20:43:57.892 controller-0 VIM_Thread[72780] INFO _hypervisor.py.133 Hypervisor compute-2 state change was unlocked-enabled now locked-enabled

2019-04-09T20:43:57.892 controller-0 VIM_Thread[72780] INFO _host_director.py.395 Notify other directors that the host compute-2 services are disabling.

2019-04-09T20:43:57.892 controller-0 VIM_Thread[72780] INFO _instance_director.py.1339 Host compute-2 services disabling.

2019-04-09T20:43:57.935 controller-0 VIM_Event-Log_Thread[72818] INFO fm.py.379 Generated customer log, fm_uuid=f7f39587-6445-413b-8fb8-0099f475f361.

2019-04-09T20:43:58.033 controller-0 VIM_Thread[72780] INFO _host_director.py.408 Notify other directors that the host compute-2 services are disabled.

2019-04-09T20:43:58.034 controller-0 VIM_Thread[72780] INFO _instance_director.py.1392 Host compute-2 services disabled.

2019-04-09T20:43:58.034 controller-0 VIM_Thread[72780] DEBUG _instance_director.py.1408 Canceling previous host operation host-disable, before continuing with host operation host-disable for compute-2.
2019-04-09T20:43:58.035 controller-0 VIM_Thread[72780] INFO _host_task_work.py.194 Queueing rebalance for host compute-2 disable

2019-04-09T20:43:58.119 controller-0 VIM_Thread[72780] INFO _hypervisor.py.133 Hypervisor compute-2 state change was locked-enabled now locked-disabled

2019-04-09T20:43:58.120 controller-0 VIM_Thread[72780] INFO _host_director.py.421 Notify other directors that the host compute-2 is disabled.
2019-04-09T20:43:58.120 controller-0 VIM_Thread[72780] INFO _instance_director.py.1427 Host compute-2 disabled.
2019-04-09T20:43:58.120 controller-0 VIM_Thread[72780] DEBUG _instance_director.py.1432 Canceling host operation host-disable for compute-2.
2019-04-09T20:43:58.121 controller-0 VIM_Thread[72780] DEBUG _host_tasks.py.287 Task (disable-host_compute-2) complete.

2019-04-09T20:43:58.121 controller-0 VIM_Thread[72780] INFO _host_state_disabling.py.36 Exiting state (disabling) for compute-2.
2019-04-09T20:43:58.121 controller-0 VIM_Thread[72780] INFO _host_state_disabled.py.29 Entering state (disabled) for compute-2.
2019-04-09T20:43:58.123 controller-0 VIM_Thread[72780] INFO _host.py.839 Host compute-2 FSM State-Change: prev_state=disabling, state=disabled, event=task-completed.

2019-04-09T20:43:58.125 controller-0 VIM_Thread[72780] INFO _host_director.py.475 Host compute-2 state change notification.
2019-04-09T20:43:58.161 controller-0 VIM_Event-Log_Thread[72818] INFO fm.py.379 Generated customer log, fm_uuid=14df24df-39a7-4770-bff4-34a51c7b4f90.
2019-04-09T20:44:07.959 controller-0 VIM_Thread[72780] INFO _vim_nfvi_audits.py.873 Audit instances called, timer_id=15.
2019-04-09T20:44:09.122 controller-0 VIM_Thread[72780] INFO _network_rebalance.py.1065 Triggering L3 Agent reschedule for disabled l3 agent host: compute-2

2019-04-09T20:44:09.122 controller-0 VIM_Thread[72780] INFO _dhcp_rebalance.py.866 Triggering DHCP Agent reschedule for disabled dhcp agent host: compute-2
2019-04-09T20:44:15.477 controller-0 VIM_Network-Worker-0_Thread[160705] INFO _timer_scheduler.py.57 Not scheduling on time, elapsed=122412 ms.
2019-04-09T20:44:15.478 controller-0 VIM_Network-Worker-0_Thread[160705] INFO _thread.py.218 Thread Network-Worker-0: not scheduling on time

2019-04-09T20:44:15.478 controller-0 VIM_Network-Worker-0_Thread[160705] INFO _thread.py.218 Thread Network-Worker-0: not scheduling on time

2019-04-09T20:44:15.478 controller-0 VIM_Network-Worker-0_Thread[160705] INFO _thread.py.232 Thread Network-Worker-0: shutting down.

2019-04-09T20:44:15.478 controller-0 VIM_Network-Worker-0_Thread[160705] INFO _thread.py.236 Thread Network-Worker-0: shutdown.
2019-04-09T20:44:18.578 controller-0 VIM_Thread[72780] ERROR Caught exception while trying to add network to agent, error=[OpenStack Rest-API Exception: method=POST, url=http://neutron-server.openstack.svc.cluster.local:9696//v2.0/agents/b8a4a817-3f90-41b7-b846-e635b5304cac/dhcp-networks, headers={'Content-Type': 'application/json'}, body={"network_id": "3a0da180-b80c-4d94-b7cb-99292ff469f6"}, status_code=404, reason=HTTP Error 404: Not Found, response_headers=[('Date', 'Tue, 09 Apr 2019 20:44:18 GMT'), ('Content-Length', '165'), ('Content-Type', 'application/json'), ('X-Openstack-Request-Id', 'req-8b7b7cad-6918-40e4-bdf5-3c51bb94cbf0')], response_body={"NeutronError": {"message": "Agent b8a4a817-3f90-41b7-b846-e635b5304cac is not a valid DHCP Agent or has been disabled.", "type": "InvalidDHCPAgent", "detail": ""}}].
Traceback (most recent call last):
  File "/usr/lib64/python2.7/site-packages/nfv_plugins/nfvi_plugins/nfvi_network_api.py", line 1518, in add_network_to_dhcp_agent
    future.result = (yield)
OpenStackRestAPIException: [OpenStack Rest-API Exception: method=POST, url=http://neutron-server.openstack.svc.cluster.local:9696//v2.0/agents/b8a4a817-3f90-41b7-b846-e635b5304cac/dhcp-networks, headers={'Content-Type': 'application/json'}, body={"network_id": "3a0da180-b80c-4d94-b7cb-99292ff469f6"}, status_code=404, reason=HTTP Error 404: Not Found, response_headers=[('Date', 'Tue, 09 Apr 2019 20:44:18 GMT'), ('Content-Length', '165'), ('Content-Type', 'application/json'), ('X-Openstack-Request-Id', 'req-8b7b7cad-6918-40e4-bdf5-3c51bb94cbf0')], response_body={"NeutronError": {"message": "Agent b8a4a817-3f90-41b7-b846-e635b5304cac is not a valid DHCP Agent or has been disabled.", "type": "InvalidDHCPAgent", "detail": ""}}]
2019-04-09T20:44:18.579 controller-0 VIM_Thread[72780] WARNING _dhcp_rebalance.py.390 Unable to add network to dhcp agent, response = {'completed': False, 'reason': '', 'result-data': ''}

2019-04-09T20:44:21.383 controller-0 VIM_Thread[72780] ERROR Caught exception while trying to add network to agent, error=[OpenStack Rest-API Exception: method=POST, url=http://neutron-server.openstack.svc.cluster.local:9696//v2.0/agents/b8a4a817-3f90-41b7-b846-e635b5304cac/dhcp-networks, headers={'Content-Type': 'application/json'}, body={"network_id": "a8e24402-829a-4002-9824-dc944f0d6480"}, status_code=404, reason=HTTP Error 404: Not Found, response_headers=[('Date', 'Tue, 09 Apr 2019 20:44:21 GMT'), ('Content-Length', '165'), ('Content-Type', 'application/json'), ('X-Openstack-Request-Id', 'req-ba08279b-9712-4213-b952-ad937fab11c9')], response_body={"NeutronError": {"message": "Agent b8a4a817-3f90-41b7-b846-e635b5304cac is not a valid DHCP Agent or has been disabled.", "type": "InvalidDHCPAgent", "detail": ""}}].
Traceback (most recent call last):
  File "/usr/lib64/python2.7/site-packages/nfv_plugins/nfvi_plugins/nfvi_network_api.py", line 1518, in add_network_to_dhcp_agent
    future.result = (yield)
OpenStackRestAPIException: [OpenStack Rest-API Exception: method=POST, url=http://neutron-server.openstack.svc.cluster.local:9696//v2.0/agents/b8a4a817-3f90-41b7-b846-e635b5304cac/dhcp-networks, headers={'Content-Type': 'application/json'}, body={"network_id": "a8e24402-829a-4002-9824-dc944f0d6480"}, status_code=404, reason=HTTP Error 404: Not Found, response_headers=[('Date', 'Tue, 09 Apr 2019 20:44:21 GMT'), ('Content-Length', '165'), ('Content-Type', 'application/json'), ('X-Openstack-Request-Id', 'req-ba08279b-9712-4213-b952-ad937fab11c9')], response_body={"NeutronError": {"message": "Agent b8a4a817-3f90-41b7-b846-e635b5304cac is not a valid DHCP Agent or has been disabled.", "type": "InvalidDHCPAgent", "detail": ""}}]

2019-04-09T20:44:21.383 controller-0 VIM_Thread[72780] WARNING _dhcp_rebalance.py.390 Unable to add network to dhcp agent, response = {'completed': False, 'reason': '', 'result-data': ''}

2019-04-09T20:44:24.963 controller-0 VIM_Thread[72780] ERROR Caught exception while trying to add router to agent, error=[OpenStack Rest-API Exception: method=POST, url=http://neutron-server.openstack.svc.cluster.local:9696//v2.0/agents/0342dcea-faee-4e13-beb3-41127ac33895/l3-routers, headers={'Content-Type': 'application/json'}, body={"router_id": "48fc7a78-e597-447d-9028-89c98c3f793a"}, status_code=404, reason=HTTP Error 404: Not Found, response_headers=[('Date', 'Tue, 09 Apr 2019 20:44:24 GMT'), ('Content-Length', '154'), ('Content-Type', 'application/json'), ('X-Openstack-Request-Id', 'req-1b5f2570-dfdb-45c6-838e-0399c744aab2')], response_body={"NeutronError": {"message": "Agent 0342dcea-faee-4e13-beb3-41127ac33895 is not a L3 Agent or has been disabled", "type": "InvalidL3Agent", "detail": ""}}].
Traceback (most recent call last):
  File "/usr/lib64/python2.7/site-packages/nfv_plugins/nfvi_plugins/nfvi_network_api.py", line 1444, in add_router_to_agent
    future.result = (yield)
OpenStackRestAPIException: [OpenStack Rest-API Exception: method=POST, url=http://neutron-server.openstack.svc.cluster.local:9696//v2.0/agents/0342dcea-faee-4e13-beb3-41127ac33895/l3-routers, headers={'Content-Type': 'application/json'}, body={"router_id": "48fc7a78-e597-447d-9028-89c98c3f793a"}, status_code=404, reason=HTTP Error 404: Not Found, response_headers=[('Date', 'Tue, 09 Apr 2019 20:44:24 GMT'), ('Content-Length', '154'), ('Content-Type', 'application/json'), ('X-Openstack-Request-Id', 'req-1b5f2570-dfdb-45c6-838e-0399c744aab2')], response_body={"NeutronError": {"message": "Agent 0342dcea-faee-4e13-beb3-41127ac33895 is not a L3 Agent or has been disabled", "type": "InvalidL3Agent", "detail": ""}}]
2019-04-09T20:44:24.963 controller-0 VIM_Thread[72780] WARNING _network_rebalance.py.486 Unable to add router to l3 agent, response = {'completed': False, 'reason': '', 'result-data': ''}
2019-04-09T20:44:50.707 controller-0 VIM_Thread[72780] DEBUG _vim_nfvi_events.py.63 Host action, host_uuid=4fe5d32b-37e5-4a6c-938c-2f63df0d6dbc, host_name=compute-2, do_action=lock.\
...

2019-04-09T20:45:15.651 controller-0 VIM_Thread[72780] INFO _network_rebalance.py.1065 Triggering L3 Agent reschedule for disabled l3 agent host: compute-1
2019-04-09T20:45:15.652 controller-0 VIM_Thread[72780] INFO _dhcp_rebalance.py.866 Triggering DHCP Agent reschedule for disabled dhcp agent host: compute-1

2019-04-09T20:45:20.138 controller-0 VIM_Thread[72780] ERROR list index out of range
Traceback (most recent call last):
  File "/usr/lib64/python2.7/site-packages/nfv_vim/vim.py", line 168, in process_main
    timers.timers_schedule()
  File "/usr/lib64/python2.7/site-packages/nfv_common/timers/_timer_module.py", line 88, in timers_schedule
    _scheduler.schedule()
  File "/usr/lib64/python2.7/site-packages/nfv_common/timers/_timer_scheduler.py", line 75, in schedule
    rearm = timer.callback(now_ms)
  File "/usr/lib64/python2.7/site-packages/nfv_common/timers/_timer.py", line 68, in callback
    self._callback.send(self._timer_id)
  File "/usr/lib64/python2.7/site-packages/nfv_vim/network_rebalance/_dhcp_rebalance.py", line 807, in _dr_timer
    _run_state_machine()
  File "/usr/lib64/python2.7/site-packages/nfv_vim/network_rebalance/_dhcp_rebalance.py", line 829, in _run_state_machine
    _get_networks_on_agents()
  File "/usr/lib64/python2.7/site-packages/nfv_vim/network_rebalance/_dhcp_rebalance.py", line 537, in _get_networks_on_agents
    agent_id, host_name = _DHCPRebalance.get_current_dhcp_agent()
  File "/usr/lib64/python2.7/site-packages/nfv_vim/network_rebalance/_dhcp_rebalance.py", line 111, in get_current_dhcp_agent
    agent_id = self.dhcp_agents[self.dhcpagent_idx]['id']
IndexError: list index out of range

Test Activity
-------------
[Platform CPU testing]

Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :
Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :
summary: - compute node lock resulted in vim-services on controller going to
- active-failed
+ compute node lock resulted in vim-services on controller unexpectedly
+ going to active-failed
Changed in starlingx:
assignee: nobody → Kevin Smith (kevin.smith.wrs)
Revision history for this message
Kevin Smith (kevin.smith.wrs) wrote :

Identical bugs in populate_dhcp_agents in _dhcp_rebalance.py and populate_l3agents in _network_rebalance.py, self.add_agent() should only be called for a down agent (if clause) or alive and admin_state_up agents (elif clause). Testing fix.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as release gating; high priority given the locks fail. Recently introduced by the neutron agent rescheduling feature in the vim

Changed in starlingx:
importance: Undecided → High
status: New → Triaged
tags: added: stx.2.0 stx.nfv
tags: added: stx.retestneeded
Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-nfv (master)

Reviewed: https://review.openstack.org/651754
Committed: https://git.openstack.org/cgit/openstack/stx-nfv/commit/?id=f6736a4b0ccbd32ef4552fef5e27273c07ca5e94
Submitter: Zuul
Branch: master

commit f6736a4b0ccbd32ef4552fef5e27273c07ca5e94
Author: Kevin Smith <email address hidden>
Date: Thu Apr 11 06:45:28 2019 -0400

    Network/Router agent rebalancing bug fix.

    Add fix to ignore down agents in rescheduling.

    Change-Id: I5d06bd489f2d67be996e55608b77e31c1ed85114
    Closes-Bug: #1824027
    Signed-off-by: Kevin Smith <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.