Been hitting this a lot and poking at it in devstack today. It seems to be more than simply nodes becoming disassociated with instances on failure, and smells more like something deeper in the scheduler. I'm curious what changes that broke this, as it was working okay previously.
Its easy to reproduce in devstack, simply enroll multiple VMs (IRONIC_VM_COUNT) and set deploy_callback_timeout to something low. You'll notice that, after the first failure, the instance get rescheduled to multiple nodes, and multiple other nodes after the second failure. I've attached a client-side log showing the transitions.
Been hitting this a lot and poking at it in devstack today. It seems to be more than simply nodes becoming disassociated with instances on failure, and smells more like something deeper in the scheduler. I'm curious what changes that broke this, as it was working okay previously.
Its easy to reproduce in devstack, simply enroll multiple VMs (IRONIC_VM_COUNT) and set deploy_ callback_ timeout to something low. You'll notice that, after the first failure, the instance get rescheduled to multiple nodes, and multiple other nodes after the second failure. I've attached a client-side log showing the transitions.