Comment 2 for bug 1854992

Revision history for this message
Matt Riedemann (mriedem) wrote :

Are there errors in the rabbitmq logs?

This is really a needle in the haystack kind of bug unless you can narrow down specifically where something is going wrong.

The compute is making DB queries/updates over RPC to conductor and those are synchronous. Is something failing or timing out there? Because I think at some point we've talked about making those object indirection API calls use the long_rpc_timeout.

Otherwise what RPC calls are you suggesting be changed from cast to call? Something from conductor to the compute? Because I'm not sure how that would help. During a build, we have:

* api casts to conductor
* conductor calls scheduler for a list of hosts
* conductor casts to the first host and passes the alternates
* if that host fails, compute casts to the cell conductor with the remaining alternates
* cell conductor checks if there are any remaining alternates to use and if so, casts to the next alternate compute - this process continues until we get a build or all alternates are exhausted at which point conductor sets the instance to ERROR status and raises MaxRetriesExceeded which ends the process

Note that the conductor<>compute interaction there is per instance, IOW a list of instances in a multi-create request are not sent to the same selected compute host. Any reschedules to alternate hosts are per-instance.

I'm going to mark this as incomplete since I think this needs quite a bit more debug/investigation from your end reproducing it to see where something helpful could be done.