[RFE] Use call_monitor_timeout of oslo.messaging RPCClient instead of custom backoff mechanism and hardcoded timeouts

Bug #2045058 reported by Ihar Hrachyshka
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
In Progress
Wishlist
Ihar Hrachyshka

Bug Description

Currently, neutron RPC clients will repeat calls, timeout, back off, repeat again... this logic is implemented in neutron-lib RPCClient itself. This is done to handle requests that take a very long time.

Instead of failing, then bumping timeout and hope that it's enough now (and leave the server unaware), we could instead enable active heartbeating with oslo.messaging call_monitor_timeout option.

See nova did this for their clients: https://opendev.org/openstack/nova/commit/fe26a52024416ed2d37c2d5027da4b23231dc515

I believe this should replace backoff logic in neutron-lib.

Changed in neutron:
status: New → Triaged
importance: Undecided → Wishlist
tags: added: rfe-triaged
tags: added: rfe-confirmed
removed: rfe-triaged
Revision history for this message
Brian Haley (brian-haley) wrote :

I know we can find the review from the commit ID, but here is a direct link. It did have a couple of small follow-ons based on the topic.

https://review.opendev.org/c/openstack/nova/+/566696

Changed in neutron:
assignee: nobody → Ihar Hrachyshka (ihar-hrachyshka)
Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

This RFE was discussed during drivers meeting, and the suggestion was to not require a spec for this change, instead:

"let's add some more details to the RFE like how it works for nova, how to handle shorter that 60sec calls etc, and push PoCn"

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :
Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

The plan is:

- engage the call_monitor_timeout option without touching rpc client backoff mechanism. (In this way, the backoff will serve as a failsafe option when timeout misbehaves for some reason.)
- monitor behavior of the automatic timeout mechanism over several cycles.
- eventually, consider removal of the backoff mechanism from neutron-lib.

Nova enabled the active heartbeating for rpc calls when rpc timeout is bumped from the default 60 seconds. This seems a historical decision, to quote, to "keep the failure timing characteristics that our code likely expects (from history)". I will check with Dan Smith who wrote this (and the patch that integrates the mechanism in nova from ~2018) to see if there is a good reason to follow this example, or we can proactively enable it for all calls. For now, I plan to apply it unconditionally, unless there is a good scaling or stability related reason not to.

Changed in neutron:
status: Triaged → In Progress
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.