Greetings,
I found a problem in our lab environment which I cant explain and somewhat seems like a bug to me.
We had a compute error which resulted in octavia getting a "No valid host was found" exception from nova.
But the newly created loadbalanceres never got into a provisioning_state ERROR but were stuck in PENDING_CREATE. And the PENDING_* states can only get reset via manual db interaction.
My assumption would be that an instance gets into provisioning_status ERROR when it cant spawn a vm and not get stuck in a pending state.
The last task message from a worker for such a lb was this:
Task 'MASTER-octavia-create-amp-for-lb-subflow-octavia-compute-wait' (e68c8dff-42fe-429b-8bc7-c3fcff191054) transitioned into state 'FAILURE' from state 'RUNNING'
With the following error message:
octavia.common.exceptions.ComputeBuildException: Failed to build compute instance due to: {'code': 500, 'created': '2024-04-03T02:35:50Z', 'message': 'No valid host was found. There are not enough hosts available.', 'details': 'Traceback (most recent call last):\n File "/var/lib/kolla/venv/lib/python3.8/site-packages/nova/conductor/manager.py", line 1580, in schedule_and_build_instances\n host_lists = self._schedule_instances(context, request_specs[0],\n File "/var/lib/kolla/venv/lib/python3.8/site-packages/nova/conductor/manager.py", line 940, in _schedule_instances\n host_lists = self.query_client.select_destinations(\n File "/var/lib/kolla/venv/lib/python3.8/site-packages/nova/scheduler/client/query.py", line 41, in select_destinations\n return self.scheduler_rpcapi.select_destinations(context, spec_obj,\n File "/var/lib/kolla/venv/lib/python3.8/site-packages/osprofiler/profiler.py", line 160, in wrapper\n result = f(*args, **kwargs)\n File "/var/lib/kolla/venv/lib/python3.8/site-packages/nova/scheduler/rpcapi.py", line 160, in select_destinations\n return cctxt.call(ctxt, \'select_destinations\', **msg_args)\n File "/var/lib/kolla/venv/lib/python3.8/site-packages/oslo_messaging/rpc/client.py", line 189, in call\n result = self.transport._send(\n File "/var/lib/kolla/venv/lib/python3.8/site-packages/oslo_messaging/transport.py", line 123, in _send\n return self._driver.send(target, ctxt, message,\n File "/var/lib/kolla/venv/lib/python3.8/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 689, in send\n return self._send(target, ctxt, message, wait_for_reply, timeout,\n File "/var/lib/kolla/venv/lib/python3.8/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 681, in _send\n raise result\nnova.exception_Remote.NoValidHost_Remote: No valid host was found. There are not enough hosts available.\nTraceback (most recent call last):\n\n File "/var/lib/kolla/venv/lib/python3.8/site-packages/oslo_messaging/rpc/server.py", line 241, in inner\n return func(*args, **kwargs)\n\n File "/var/lib/kolla/venv/lib/python3.8/site-packages/nova/scheduler/manager.py", line 223, in select_destinations\n selections = self._select_destinations(\n\n File "/var/lib/kolla/venv/lib/python3.8/site-packages/nova/scheduler/manager.py", line 250, in _select_destinations\n selections = self._schedule(\n\n File "/var/lib/kolla/venv/lib/python3.8/site-packages/nova/scheduler/manager.py", line 416, in _schedule\n self._ensure_sufficient_hosts(\n\n File "/var/lib/kolla/venv/lib/python3.8/site-packages/nova/scheduler/manager.py", line 455, in _ensure_sufficient_hosts\n raise exception.NoValidHost(reason=reason)\n\nnova.exception.NoValidHost: No valid host was found. There are not enough hosts available.\n\n'}
So the worker sees it as a failure but does not seem to change the state?
As I have never seen this behviour before I suspect it may correlate with our upgrade to yoga(10.1.0) but thats just a guess.
Thanks for any help in advance :)
After a bit further debugging I found all those lbs that were stuck in PENDING_CREATE had one amphora stuck in `BOOTING`.
And if I trace that amphora to the nova vm the vm is in state error with the fault message no valid host was found.
So it seems to me that somewhere octavia is not updating its state correctly for those vms that are in an error state.