NoValidHost exception results in PENDING_CREATE stuck

Bug #2060099 reported by Maximilian Stinsky
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
octavia
Invalid
Undecided
Unassigned

Bug Description

Greetings,

I found a problem in our lab environment which I cant explain and somewhat seems like a bug to me.

We had a compute error which resulted in octavia getting a "No valid host was found" exception from nova.
But the newly created loadbalanceres never got into a provisioning_state ERROR but were stuck in PENDING_CREATE. And the PENDING_* states can only get reset via manual db interaction.

My assumption would be that an instance gets into provisioning_status ERROR when it cant spawn a vm and not get stuck in a pending state.

The last task message from a worker for such a lb was this:
Task 'MASTER-octavia-create-amp-for-lb-subflow-octavia-compute-wait' (e68c8dff-42fe-429b-8bc7-c3fcff191054) transitioned into state 'FAILURE' from state 'RUNNING'

With the following error message:
octavia.common.exceptions.ComputeBuildException: Failed to build compute instance due to: {'code': 500, 'created': '2024-04-03T02:35:50Z', 'message': 'No valid host was found. There are not enough hosts available.', 'details': 'Traceback (most recent call last):\n File "/var/lib/kolla/venv/lib/python3.8/site-packages/nova/conductor/manager.py", line 1580, in schedule_and_build_instances\n host_lists = self._schedule_instances(context, request_specs[0],\n File "/var/lib/kolla/venv/lib/python3.8/site-packages/nova/conductor/manager.py", line 940, in _schedule_instances\n host_lists = self.query_client.select_destinations(\n File "/var/lib/kolla/venv/lib/python3.8/site-packages/nova/scheduler/client/query.py", line 41, in select_destinations\n return self.scheduler_rpcapi.select_destinations(context, spec_obj,\n File "/var/lib/kolla/venv/lib/python3.8/site-packages/osprofiler/profiler.py", line 160, in wrapper\n result = f(*args, **kwargs)\n File "/var/lib/kolla/venv/lib/python3.8/site-packages/nova/scheduler/rpcapi.py", line 160, in select_destinations\n return cctxt.call(ctxt, \'select_destinations\', **msg_args)\n File "/var/lib/kolla/venv/lib/python3.8/site-packages/oslo_messaging/rpc/client.py", line 189, in call\n result = self.transport._send(\n File "/var/lib/kolla/venv/lib/python3.8/site-packages/oslo_messaging/transport.py", line 123, in _send\n return self._driver.send(target, ctxt, message,\n File "/var/lib/kolla/venv/lib/python3.8/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 689, in send\n return self._send(target, ctxt, message, wait_for_reply, timeout,\n File "/var/lib/kolla/venv/lib/python3.8/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 681, in _send\n raise result\nnova.exception_Remote.NoValidHost_Remote: No valid host was found. There are not enough hosts available.\nTraceback (most recent call last):\n\n File "/var/lib/kolla/venv/lib/python3.8/site-packages/oslo_messaging/rpc/server.py", line 241, in inner\n return func(*args, **kwargs)\n\n File "/var/lib/kolla/venv/lib/python3.8/site-packages/nova/scheduler/manager.py", line 223, in select_destinations\n selections = self._select_destinations(\n\n File "/var/lib/kolla/venv/lib/python3.8/site-packages/nova/scheduler/manager.py", line 250, in _select_destinations\n selections = self._schedule(\n\n File "/var/lib/kolla/venv/lib/python3.8/site-packages/nova/scheduler/manager.py", line 416, in _schedule\n self._ensure_sufficient_hosts(\n\n File "/var/lib/kolla/venv/lib/python3.8/site-packages/nova/scheduler/manager.py", line 455, in _ensure_sufficient_hosts\n raise exception.NoValidHost(reason=reason)\n\nnova.exception.NoValidHost: No valid host was found. There are not enough hosts available.\n\n'}

So the worker sees it as a failure but does not seem to change the state?

As I have never seen this behviour before I suspect it may correlate with our upgrade to yoga(10.1.0) but thats just a guess.

Thanks for any help in advance :)

Revision history for this message
Maximilian Stinsky (mstinsky) wrote :

After a bit further debugging I found all those lbs that were stuck in PENDING_CREATE had one amphora stuck in `BOOTING`.

And if I trace that amphora to the nova vm the vm is in state error with the fault message no valid host was found.

So it seems to me that somewhere octavia is not updating its state correctly for those vms that are in an error state.

Revision history for this message
Gregory Thiemonge (gthiemonge) wrote :

Hi, thanks for reporting it,

It really looks like https://bugs.launchpad.net/octavia/+bug/2043360
which is a bug in taskflow https://bugs.launchpad.net/taskflow/+bug/2043808

did the flow reach the REVERTED state? something like:

WARNING octavia.controller.worker.v2.controller_worker [-] Flow 'octavia-create-loadbalancer-flow' (9c221fbb-d7c5-413b-9b17-3a6ca92e8d7a) transitioned into state 'REVERTED' from state 'RUNNING'

what was the last task executed by this flow?

WARNING octavia.controller.worker.v2.controller_worker [-] Task 'BACKUP-octavia-create-amp-for-lb-subflow-octavia-compute-wait' (38a8143e-4816-4a16-ae05-5037003245ba) transitioned into state 'REVERTED' from state 'REVERTING' with result 'None'

if the last task is "*-compute-wait" and not "LoadBalancerToErrorOnRevertTask", it's the same bug.

Revision history for this message
Gregory Thiemonge (gthiemonge) wrote :

sorry the final task is not necessarily compute-wait, it could be anything.

So the phrase should be:

| if the last task is not "LoadBalancerToErrorOnRevertTask", it's the same bug

Revision history for this message
Maximilian Stinsky (mstinsky) wrote :

Hi Gregory,

thanks for the quick answer.
No, I cant find any tasks with the name "LoadBalancerToErrorOnRevertTask".
So it does seem that it is the same bug.

I think we can mark this as duplicated then.

Revision history for this message
Gregory Thiemonge (gthiemonge) wrote :
Changed in octavia:
status: New → Invalid
Revision history for this message
Gregory Thiemonge (gthiemonge) wrote :

I have an idea on how to fix it in Octavia, some kind of workaround for the taskflow bug, I hope I'll be able to work on it at the beginning of the D cycle (in the next weeks)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.