Problem seems to be around updating the existing plan during the deployment from the CLI. This error has only occurred in multinode jobs (not ovb). The first occurrences started on this patch: https://review.openstack.org/#/c/368760/
The error does not happen 100% of the time, but it can be seen on some earlier CI results of that patch.
We've tried several workarounds to to address the problem:
reducing the mistral workers (since ovb has less vcpu's than multinode and does not see this issue): https://review.openstack.org/370847
that failed the same way.
Also tried deleting the existing plan before starting the deployment: https://review.openstack.org/#/c/370857/
That also failed with the messaging timeout, but exposed the issue that the default plan may not yet finish being created before we start the overcloud deployment. When we start the deployment and try to then update the plan, we could be tripping over ourselves and causing this error.
Problem seems to be around updating the existing plan during the deployment from the CLI. This error has only occurred in multinode jobs (not ovb). The first occurrences started on this patch: /review. openstack. org/#/c/ 368760/
https:/
The error does not happen 100% of the time, but it can be seen on some earlier CI results of that patch.
We've tried several workarounds to to address the problem: /review. openstack. org/370847
reducing the mistral workers (since ovb has less vcpu's than multinode and does not see this issue):
https:/
that failed the same way.
Also tried deleting the existing plan before starting the deployment: /review. openstack. org/#/c/ 370857/
https:/
That also failed with the messaging timeout, but exposed the issue that the default plan may not yet finish being created before we start the overcloud deployment. When we start the deployment and try to then update the plan, we could be tripping over ourselves and causing this error.
Dougal has a patch to wait to make sure the default plan is created which may be the true fix: /review. openstack. org/#/c/ 369247/ /bugs.launchpad .net/tripleo/ +bug/1623891
https:/
But that was not being tested appropriately due to:
https:/
where we were not testing patches with delorean due to unintentionally deleting the delorean db.
As of now, we are attempting to land this revert: /review. openstack. org/#/c/ 370434/ /review. openstack. org/370922
https:/
However, given the other CI issue with not testing patches correctly we can't land that revert until we temporarily make the multinode job nonvoting:
https:/
Once that project-config patch lands, we plan to land these 3 patches: /review. openstack. org/#/c/ 370434/ (fixes this bug) /review. openstack. org/#/c/ 370250/ (separate issue needed to bring ovb back) /review. openstack. org/#/c/ 369792/ (fixes bug with CI not testing patches)
https:/
https:/
https:/
we will then re-enable the multinode job as voting.