Frequent instances stuck in BUILD with no apparent failure

Bug #1854992 reported by Erik Olof Gunnar Andersson
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Incomplete
Undecided
Unassigned

Bug Description

We are getting frequent instances stuck indefinitely in BUILD without an error message. This seems to be triggered by high concurrency (e.g. build a lot of instances with terraform).

We have multiple synthetic instances that are being built and destroyed ever 10 minutes and they never hit this issue.

This is running one commit behind of the latest stable/rocky branch.

Tags: rpc
Revision history for this message
Erik Olof Gunnar Andersson (eandersson) wrote :

Right now we are suspecting that this is caused by issues with RabbitMQ. If a message for some reason isn't delivered to the compute (e.g. broken bindings) the instance will never be built and indefinitely stuck in BUILD.

Even worse is that there is nothing in the logs, and there is nothing in the api that would indicate the compute with issues. It would be helpeful to set the compute before the message is sent to the compute, or maybe change the RabbitMQ message to be an RPC, so that it will timeout.

Another idea would be to make the message sent mandatory.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Are there errors in the rabbitmq logs?

This is really a needle in the haystack kind of bug unless you can narrow down specifically where something is going wrong.

The compute is making DB queries/updates over RPC to conductor and those are synchronous. Is something failing or timing out there? Because I think at some point we've talked about making those object indirection API calls use the long_rpc_timeout.

Otherwise what RPC calls are you suggesting be changed from cast to call? Something from conductor to the compute? Because I'm not sure how that would help. During a build, we have:

* api casts to conductor
* conductor calls scheduler for a list of hosts
* conductor casts to the first host and passes the alternates
* if that host fails, compute casts to the cell conductor with the remaining alternates
* cell conductor checks if there are any remaining alternates to use and if so, casts to the next alternate compute - this process continues until we get a build or all alternates are exhausted at which point conductor sets the instance to ERROR status and raises MaxRetriesExceeded which ends the process

Note that the conductor<>compute interaction there is per instance, IOW a list of instances in a multi-create request are not sent to the same selected compute host. Any reschedules to alternate hosts are per-instance.

I'm going to mark this as incomplete since I think this needs quite a bit more debug/investigation from your end reproducing it to see where something helpful could be done.

tags: added: rpc
Changed in nova:
status: New → Incomplete
Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
Erik Olof Gunnar Andersson (eandersson) wrote :

I think the biggest thing here is that if for some reason the scheduling message to the compute is lost in transit. The call will fail silently, get stuck indefinitely in BUILD and has a huge toil on users (tools like Terraform does not handle this well) and operators, because there is no logs and the target computes are not listed in any api commands.

We have encountered multiple RabbitMQ related bugs causing the above, and while the core of the issue is related to RabbitMQ, because there is nothing on the Nova side to indicate either an error (e.g. BUILD timeout, RPC timeout) or logs to provide a hint at where something went wrong (or which computer rather).

The issue on the RabbitMQ side is because a RabbitMQ node either crashes or is hit by a network partition and does not recover properly. The last bug we hit was this (and you can see a comment from someone else running OpenStack as well)
https://github.com/rabbitmq/rabbitmq-server/issues/641

I would suggest that at the very least we add a log before the message is sent
> Sending scheduler event <bla> to compute y

Looking into making the calls RPC would help as well as it would allow failed instances to go into ERROR'd at least.

Revision history for this message
melanie witt (melwitt) wrote :

This came up again today in #openstack-nova [1] where sean-k-mooney and I were discussing how we could leverage the new(-ish) 'mandatory' oslo.messaging option [2] in nova to recover from a rabbitmq queue being gone.

What (we think) we understand so far is that we could specify the mandatory=True flag to the RPCClient at construction time, that would cause it to raise MessageUndeliverable if the rabbit queue has gone away, and then we'd have to try-except to handle MessageUndeliverable and create the queue in that case.

What we don't know yet is if we could put that try-except in a tidy, central place in nova/rpc.py and thus have any component who hits MessageUndeliverable create the queue -- that is, if nova-conductor gets MessageUndeliverable, can it create the queue and if it does, will nova-compute listen be able to listen on it without having to re-init? Or will nova-compute need to re-init?

Sean might be able to experiment a bit with a setup to test some ideas out and determine what's the best way we can handle queue re-creates + listening.

[1] http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-nova.2020-06-04.log.html#t2020-06-04T17:27:23
[2] https://github.com/Gsantomaggio/rabbitmq-utils/tree/master/openstack/mandatory_test

Revision history for this message
melanie witt (melwitt) wrote :

Update: I have found that apparently oslo.messaging began setting the mandatory flag to True for rabbitmq by default, starting in version 10.2.0 (train) [1]:

https://github.com/openstack/oslo.messaging/commit/b7e9faf6590086b79f9696183b5c296ecc12b7b6

So, that means we're already getting MessageUndeliverable raised, so what's potentially left is to do is to add handling of the exception to nova.

[1] https://releases.openstack.org/train/index.html#train-oslo-messaging

Revision history for this message
sean mooney (sean-k-mooney) wrote :

by the way i deployed master(victoria) today
and booted a vm then proceeded to delete compute.numa-1 queue (the topic queue for that compute service) and the bahvior that i am seeing is that the queue is automatically recreated
so when i tried to trigger an rpc by getting the console log the queue already existed.

i also tried to force the issue by deleting the queue then restarting rabbit mq.
that similarly resulted in teh queue being recreated as did deleting the queue and the nova exchange.

so either this has been fixed on master or we do not have a way to reproduce it.

Revision history for this message
sean mooney (sean-k-mooney) wrote :

it look like when i delete the queue that causes the following excption and reconnect

│Aug 12 13:14:24 numa-1 nova-compute[208013]: ERROR oslo.messaging._drivers.impl_rabbit [-] [73af055b-09f6-49b0-a296-3e8922570706] AMQP server on 172.20.4.244:5672 is u│
nreachable: Basic.cancel: (0) 2. Trying again in 1 seconds.: amqp.exceptions.ConsumerCancelled: Basic.cancel: (0) 2 │
│Aug 12 13:14:25 numa-1 nova-compute[208013]: INFO oslo.messaging._drivers.impl_rabbit [-] [73af055b-09f6-49b0-a296-3e8922570706] Reconnected to AMQP server on 172.20.4│
.244:5672 via [amqp] client with port 48114.

which is fixing the queue.

so this is looking more and more liek a rabbitmq issue where the queu actully still exists but we dont deliver message to it anymore

to simulate that i unbound the compute.numa-1 queue form the topic
so it still exits but without a topic routing key

in this configurtion i see the rpcs are basically lost

and the agent does not self heal.

there are no erros in the n-cpu log and the compute service is still up as it is able to send messages but not receive them.

restarting the agent also fixes the issue as it resubsibes the queue to the correct topic.

so i think that is what is actully happening.

either rabbit mq is losing track of the topic or the the consumer of the topic as suggested in
https://github.com/rabbitmq/rabbitmq-server/issues/641

Revision history for this message
sean mooney (sean-k-mooney) wrote :

in the nova api the console output request just ended up as timeouts
oslo_messaging.exceptions.MessagingTimeout'

melanie by the way https://github.com/openstack/oslo.messaging/commit/b7e9faf6590086b79f9696183b5c296ecc12b7b6 just enabled it for direct sends

not for topic sends so i think there is still a gap here as we are not getting
MessageUndeliverable we are getting timeouts.

i think we can fix this in 2 ways
a.) delete the queue and just leave the compute agent fix its self via the reconnect.
b.) fix the topic.

im kind of tempted to say do a.) becasuse that will fix its because of other issue not just the topic.

Revision history for this message
Erik Olof Gunnar Andersson (eandersson) wrote :

I agree. a.) sounds resonable.

btw it looks like this is one of the main causes for issues with these queues ending up in a bad state (at least for us)
https://bugs.launchpad.net/oslo.messaging/+bug/1789177

Revision history for this message
Ben Nemec (bnemec) wrote :

That sounds fine to me also, although I'm not a messaging expert. I'll try to remember to ping kgiusti tomorrow about this.

Revision history for this message
Ken Giusti (kgiusti) wrote :

Just to make sure my understanding is correct:

The RPC call is failing because the binding between the destination exchange and queue no longer exists. This causes the RPC call to "black hole" the published request.

Yeah, if that is the case then option a.) sounds like the most feasible approach.

I tried simulating this using a test RPC client and server: after the server establishes the exchange-binding-queue I used rabbitmqadmin to delete just the binding. This results in all following RPC calls to time out.

I then modified my RPC client to pass in that transport option:

        _topts = messaging.TransportOptions(at_least_once=True)
        client = messaging.RPCClient(self.transport,
                                     self.target,
                                     timeout=conf.timeout,
                                     version_cap=conf.target_version,
                                     transport_options=_topts).prepare()

This change will cause the RPC call to immediately fail with a MessageUndeliverable exception.

However this doesn't fix the binding. When the binding is deleted the RPC _server_ gets no notification that anything is wrong, probably because the server simply subscribes to a queue after setting up the binding and as far as the server is concerned that queue is doing just fine.

Since there's no event occurring on the _server_ side I'm not sure if oslo.messaging can detect this particular problem and recover automagically.

Revision history for this message
Fabian Zimmermann (dev-faz) wrote :

Just my 3cents.

I was able to "reproduce" this problem (sometimes) in our dev Env. The binding is not lost it is just no longer working.

There are other issues with rabbitmq, replication and strange behavior under high load.

f. e. there is a small possibility for a mirrored queue to not complete replication if a node fails. This may lead to durable queues no longer be reachable, if the master node was the one which failed.

Using nondurable queues and replication on the other side may trigger the binding issue above a more often.

For both problems, deleting the queue fixes the issue. Broken bindings get cleared and nonreachable queues are re(created)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.