commit 5af632e9cab670cc25c2f627bb0d4c0a02258277
Author: Matt Riedemann <email address hidden>
Date: Wed Oct 3 18:57:45 2018 -0400
Use long_rpc_timeout in select_destinations RPC call
Conductor RPC calls the scheduler to get hosts during
server create, which in a multi-create request with a
lot of servers and the default rpc_response_timeout, can
trigger a MessagingTimeout. Due to the old
retry_select_destinations decorator, conductor will retry
the select_destinations RPC call up to max_attempts times,
so thrice by default. This can clobber the scheduler and
placement while the initial scheduler worker is still
trying to process the beefy request and allocate resources
in placement.
This has been recreated in a devstack test patch [1] and
shown to fail with 1000 instances in a single request with
the default rpc_response_timeout of 60 seconds. Changing the
rpc_response_timeout to 300 avoids the MessagingTimeout and
retry loop.
Since Rocky we have the long_rpc_timeout config option which
defaults to 1800 seconds. The RPC client can thus be changed
to heartbeat the scheduler service during the RPC call every
$rpc_response_timeout seconds with a hard timeout of
$long_rpc_timeout. That change is made here.
As a result, the problematic retry_select_destinations
decorator is also no longer necessary and removed here. That
decorator was added in I2b891bf6d0a3d8f45fd98ca54a665ae78eab78b3
and was a hack for scheduler high availability where a
MessagingTimeout was assumed to be a result of the scheduler
service dying so retrying the request was reasonable to hit
another scheduler worker, but is clearly not sufficient
in the large multi-create case, and long_rpc_timeout is a
better fit for that HA type scenario to heartbeat the scheduler
service.
Reviewed: https:/ /review. openstack. org/607735 /git.openstack. org/cgit/ openstack/ nova/commit/ ?id=5af632e9cab 670cc25c2f627bb 0d4c0a02258277
Committed: https:/
Submitter: Zuul
Branch: master
commit 5af632e9cab670c c25c2f627bb0d4c 0a02258277
Author: Matt Riedemann <email address hidden>
Date: Wed Oct 3 18:57:45 2018 -0400
Use long_rpc_timeout in select_destinations RPC call
Conductor RPC calls the scheduler to get hosts during timeout, can select_ destinations decorator, conductor will retry
server create, which in a multi-create request with a
lot of servers and the default rpc_response_
trigger a MessagingTimeout. Due to the old
retry_
the select_destinations RPC call up to max_attempts times,
so thrice by default. This can clobber the scheduler and
placement while the initial scheduler worker is still
trying to process the beefy request and allocate resources
in placement.
This has been recreated in a devstack test patch [1] and timeout of 60 seconds. Changing the response_ timeout to 300 avoids the MessagingTimeout and
shown to fail with 1000 instances in a single request with
the default rpc_response_
rpc_
retry loop.
Since Rocky we have the long_rpc_timeout config option which response_ timeout seconds with a hard timeout of rpc_timeout. That change is made here.
defaults to 1800 seconds. The RPC client can thus be changed
to heartbeat the scheduler service during the RPC call every
$rpc_
$long_
As a result, the problematic retry_select_ destinations f45fd98ca54a665 ae78eab78b3 meout was assumed to be a result of the scheduler
decorator is also no longer necessary and removed here. That
decorator was added in I2b891bf6d0a3d8
and was a hack for scheduler high availability where a
MessagingTi
service dying so retrying the request was reasonable to hit
another scheduler worker, but is clearly not sufficient
in the large multi-create case, and long_rpc_timeout is a
better fit for that HA type scenario to heartbeat the scheduler
service.
[1] https:/ /review. openstack. org/507918/
Change-Id: I87d89967bbc5fb f59cf44d9a63eb6 e9d477ac1f3
Closes-Bug: #1795992