service_update() should not set an RPC timeout longer than service.report_interval

Bug #1368989 reported by Chris Friesen
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
jichenjc
Juno
Fix Released
Medium
Alan Pevec

Bug Description

nova.servicegroup.drivers.db.DbDriver._report_state() is called every service.report_interval seconds from a timer in order to periodically report the service state. It calls self.conductor_api.service_update().

If this ends up calling nova.conductor.rpcapi.ConductorAPI.service_update(), it will do an RPC call() to nova-conductor.

If anything happens to the RPC server (failover, switchover, etc.) by default the RPC code will wait 60 seconds for a response (blocking the timer-based calling of _report_state() in the meantime). This is long enough to cause the status in the database to get old enough that other services consider this service to be "down".

Arguably, since we're going to call service_update( ) again in service.report_interval seconds there's no reason to wait the full 60 seconds. Instead, it would make sense to set the RPC timeout for the service_update() call to to something slightly less than service.report_interval seconds.

I've also submitted a related bug report (https://bugs.launchpad.net/bugs/1368917) to improve RPC loss of connection in general, but I expect that'll take a while to deal with while this particular case can be handled much more easily.

Sean Dague (sdague)
Changed in nova:
status: New → Confirmed
importance: Undecided → Medium
jichenjc (jichenjc)
Changed in nova:
assignee: nobody → jichenjc (jichenjc)
Revision history for this message
jichenjc (jichenjc) wrote :

how about the rpc_reponse_timeout is smaller than report_interval?
that means we can use rpc_response_timeout?

And, can we import rpc_response_timeout in nova code? should we add some code to return from oslo?

Revision history for this message
Chris Friesen (cbf123) wrote :

Yes, if rpc_reponse_timeout is smaller than report_interval then we could use the existing RPC timeout. (Though I don't think it would really hurt to make it longer. Past a certain point it doesn't make much difference, it's probably not going to get a reply.

If you did want to use it, could you just grab it via CONF.import_opt('rpc_response_timeout')?

Anyways, here's what I have in a private patch (against Havana, so it might not apply cleanly to current master).

diff --git a/nova/conductor/rpcapi.py b/nova/conductor/rpcapi.py
index e15aee1..ce2288c 100644
--- a/nova/conductor/rpcapi.py
+++ b/nova/conductor/rpcapi.py
@@ -391,8 +391,13 @@ class ConductorAPI(rpcclient.RpcProxy):
         return cctxt.call(context, 'compute_node_delete', node=node_p)

     def service_update(self, context, service, values):
+ # If we're calling this periodically, it makes no sense for the RPC
+ # timeout to be more than the service report interval.
+ timeout = service.get('report_interval')
+ if timeout and timeout > 5:
+ timeout -= 1
         service_p = jsonutils.to_primitive(service)
- cctxt = self.client.prepare(version='1.34')
+ cctxt = self.client.prepare(version='1.34', timeout=timeout)
         return cctxt.call(context, 'service_update',
                           service=service_p, values=values)

diff --git a/nova/servicegroup/drivers/db.py b/nova/servicegroup/drivers/db.py
index 31bd6f2..fb50d31 100644
--- a/nova/servicegroup/drivers/db.py
+++ b/nova/servicegroup/drivers/db.py
@@ -103,6 +103,9 @@ class DbDriver(api.ServiceGroupDriver):
             report_count = service.service_ref['report_count'] + 1
             state_catalog['report_count'] = report_count

+ # we'll make use of this in the RPC call to set the timeout
+ service.service_ref['report_interval'] = service.report_interval
+
             service.service_ref = self.conductor_api.service_update(ctxt,
                     service.service_ref, state_catalog)

Revision history for this message
jichenjc (jichenjc) wrote :

ok, thanks a lot for you let me know this

my thought is similar to your patch except the rpc_timeout_stuff
I will think about it more and submit a patch, thanks

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/124359

Changed in nova:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/124359
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=197bb467c0fa33700e5397c934fa10d8c16f1fbc
Submitter: Jenkins
Branch: master

commit 197bb467c0fa33700e5397c934fa10d8c16f1fbc
Author: jichenjc <email address hidden>
Date: Sun Sep 21 05:19:49 2014 +0800

    Use reasonable timeout for rpc service_update()

    nova.servicegroup.drivers.db.DbDriver._report_state() is called every
    service.report_interval seconds from a timer in order to periodically
    report the service state. It calls self.conductor_api.service_update().

    If this ends up calling nova.conductor.rpcapi.ConductorAPI.service_update(),
    it will do an RPC call() to nova-conductor.

    If anything happens which causes the RPC reply to be lost or
    never sent in the first place, by default the RPC code will
    wait 60 seconds for a response (blocking the timer-based
    calling of _report_state() in the meantime).
    This is long enough to cause the status in the database to get old
    enough that other services consider this service to be "down".

    if rpc_reponse_timeout is smaller than report_interval then we could
    use the existing RPC timeout, but wait longer won't hurt. So the patch
    didn't check it and only use report_interval.

    Change-Id: I88743183bce1a534812cfe6110c3fc2892058c53
    Closes-Bug: #1368989

Changed in nova:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in nova:
milestone: none → kilo-1
status: Fix Committed → Fix Released
Alan Pevec (apevec)
tags: added: juno-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/juno)

Fix proposed to branch: stable/juno
Review: https://review.openstack.org/154107

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/juno)

Reviewed: https://review.openstack.org/154107
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=7a30e354fe34510ce38b5fd8c79d6312b75cd367
Submitter: Jenkins
Branch: stable/juno

commit 7a30e354fe34510ce38b5fd8c79d6312b75cd367
Author: jichenjc <email address hidden>
Date: Sun Sep 21 05:19:49 2014 +0800

    Use reasonable timeout for rpc service_update()

    nova.servicegroup.drivers.db.DbDriver._report_state() is called every
    service.report_interval seconds from a timer in order to periodically
    report the service state. It calls self.conductor_api.service_update().

    If this ends up calling nova.conductor.rpcapi.ConductorAPI.service_update(),
    it will do an RPC call() to nova-conductor.

    If anything happens which causes the RPC reply to be lost or
    never sent in the first place, by default the RPC code will
    wait 60 seconds for a response (blocking the timer-based
    calling of _report_state() in the meantime).
    This is long enough to cause the status in the database to get old
    enough that other services consider this service to be "down".

    if rpc_reponse_timeout is smaller than report_interval then we could
    use the existing RPC timeout, but wait longer won't hurt. So the patch
    didn't check it and only use report_interval.

    Change-Id: I88743183bce1a534812cfe6110c3fc2892058c53
    Closes-Bug: #1368989
    (cherry picked from commit 197bb467c0fa33700e5397c934fa10d8c16f1fbc)

tags: added: in-stable-juno
Thierry Carrez (ttx)
Changed in nova:
milestone: kilo-1 → 2015.1.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.