oslo.messaging

[RFE] allow to set hard RPC timeout

Bug #1672836 reported by Ihar Hrachyshka on 2017-03-14

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	oslo.messaging	Confirmed	Wishlist	Unassigned

Bug Description

In Neutron, on SIGTERM, we want to make Open vSwitch agent exit before runlevel systems decide to abrupt the process with SIGKILL. By default, systemd abrupts processes that refuse to exit after 90 secs. The Open vSwitch agent may need to complete some work before calling sys.exit() so instead of exiting, it sets timeout for all RPC clients to quitting_rpc_timeout (an option defined in Neutron, default = 10) so that next RPC operations are limited by that new value. The problem is that existing RPC communication (.call) locks the current green thread for up to the time by rpc_timeout option in oslo.messaging (default = 60) which slows down graceful agent exit and raises the risk of SIGKILL from systemd.

It would be great to have public API for oslo.messaging that would allow to set a hard RPC timeout that would affect ongoing RPC communication and abrupt it with Timeout exception raised in case operations hit the set timeout.

Let me describe the intended behavior with an example.

1. agent starts with rpc_timeout = 60.
2. agent starts a .call at time T.
3. at time T + 20, the process receives the signal to exit. Signal handler calls TRANSPORT.set_hard_rpc_timeout_in(10).
4. .call triggered in 2 proceeds execution up to T + 30, then receives Timeout. Consequent .calls immediately raise Timeouts because the time allocated for remaining RPC communication.
5. if .call from step 2 completes at T + 21, and the agent triggers consequent .call, it still has 9 seconds left till hard timeout, so it executes. If it completes before 9 seconds are gone, it gets reply; if not, it gets Timeout.

The idea behind that proposal is to allow consuming processes to limit time impact of ongoing and to-be-executed RPC calls without necessarily abrupting them with sys.exit() if they won't take too long to process.

I guess consumers may try to TRANSPORT.shutdown() these days, but I don't think it's safe to do from signal handler, and it doesn't give a chance to ongoing processing to complete without hitting Timeouts from RPC layer.

Ben Nemec (bnemec) on 2018-04-23

Changed in oslo.messaging:
status:	New → Confirmed
importance:	Undecided → Wishlist

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.