Heartbeats stop when time is changed

Bug #1510234 reported by Diana Clarke
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Undecided
Roman Podoliaka
Pike
Fix Committed
Undecided
John Smith
masakari
Fix Released
Undecided
SamP
oslo.service
Fix Released
Undecided
Unassigned

Bug Description

Heartbeats stop working when you mess with the system time. If a monotonic clock were used, they would continue to work when the system time was changed.

Steps to reproduce:

1. List the nova services ('nova-manage service list'). Note that the 'State' for each services is a happy face ':-)'.

2. Move the time ahead (for example 2 hours in the future), and then list the nova services again. Note that heartbeats continue to work and use the future time (see 'Updated_At').

3. Revert back to the actual time, and list the nova services again. Note that all heartbeats stop, and have a 'State' of 'XXX'.

4. The heartbeats will start again in 2 hours when the actual time catches up to the future time, or if you restart the services.

5. You'll see a log message like the following when the heartbeats stop:

2015-10-26 17:14:10.538 DEBUG nova.servicegroup.drivers.db [req-c41a2ad7-e5a5-4914-bdc8-6c1ca8b224c6 None None] Seems service is down. Last heartbeat was 2015-10-26 17:20:20. Elapsed time is -369.461679 from (pid=13994) is_up /opt/stack/nova/nova/servicegroup/drivers/db.py:80

Here's example output demonstrating the issue:

    http://paste.openstack.org/show/477404/

See bug #1450438 for more context:

    https://bugs.launchpad.net/oslo.service/+bug/1450438

Long story short: looping call is using the built-in time rather than a monotonic clock for sleeps.

    https://github.com/openstack/oslo.service/blob/3d79348dae4d36bcaf4e525153abf74ad4bd182a/oslo_service/loopingcall.py#L122

Oslo Service: version 0.11
Nova: master (commit 2c3f9c339cae24576fefb66a91995d6612bb4ab2)

Tags: oslo
description: updated
description: updated
Matt Riedemann (mriedem)
tags: added: oslo
description: updated
Matt Riedemann (mriedem)
Changed in oslo.service:
status: New → Confirmed
no longer affects: nova
Revision history for this message
Diana Clarke (diana-clarke) wrote :

Someone suggested that I try the following approach.

diff --git a/nova/service.py b/nova/service.py
index a09187d..396bf30 100644
--- a/nova/service.py
+++ b/nova/service.py
@@ -21,6 +21,9 @@ import os
 import random
 import sys

+import eventlet
+import monotonic
+
 from oslo_concurrency import processutils
 from oslo_config import cfg
 from oslo_log import log as logging
@@ -139,6 +142,10 @@ class Service(service.Service):
                  periodic_interval_max=None, db_allowed=True,
                  *args, **kwargs):
         super(Service, self).__init__()
+
+ hub = eventlet.hubs.get_hub()
+ hub.clock = monotonic.monotonic
+
         self.host = host
         self.binary = binary
         self.topic = topic

After applying the above patch, all services except nova-conductor use monotonic for the clock in loopingcall. The nova-conductor service isn't working because it continues to use the bulit-in time, and sleeps in loopingcall until the actual time catches up to the future time. I'm not sure why the nova-conductor service continues to use the built-in time after the patch. Does anyone else know?

    http://paste.openstack.org/show/477406/

Thoughts on this approach or other approaches?

Thanks!

Revision history for this message
Sean Dague (sdague) wrote :

I'm not sure why nova got added back in. This seems pretty isolated to oslo.service bug.

no longer affects: nova
Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote :

Looks like Diana maintained lots of notes here:
https://github.com/dianaclarke/openstack-notes/wiki/time-warp-bug

Revision history for this message
Diana Clarke (diana-clarke) wrote :

Please don't look too hard at my OpenStack notes on GitHub -- I'm new to OpenStack & there are probably embarrassing bits there :)

I did re-summarize the relevant parts and include them in the above bug description though.

PS. I'd love to work with someone on this if you have any new ideas. I got stuck & moved on after creating this bug to capture what I already tried.

Cheers!

Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote :

No worries Diana, it had a lot of clues. Here's my effort:
https://review.openstack.org/#/c/286838/

Please see if this works for you

Revision history for this message
Diana Clarke (diana-clarke) wrote :

This change was merged, so I'm marking this as released. Thanks!

    https://review.openstack.org/#/c/286838/

Changed in oslo.service:
status: Confirmed → Fix Released
Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/oslo.service 1.9.0

This issue was fixed in the openstack/oslo.service 1.9.0 release.

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :
Changed in nova:
assignee: nobody → Roman Podoliaka (rpodolyaka)
status: New → In Progress
Changed in nova:
assignee: Roman Podoliaka (rpodolyaka) → Stephen Finucane (stephenfinucane)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/434327
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=a7505ee648b22f5827117bcc1192b5654c29cf9c
Submitter: Jenkins
Branch: master

commit a7505ee648b22f5827117bcc1192b5654c29cf9c
Author: Roman Podoliaka <email address hidden>
Date: Wed Feb 15 16:47:42 2017 +0200

    Make eventlet hub use a monotonic clock

    If system time is adjusted first forward and then backward while a
    nova service is running (e.g. nova-compute), then there is a high
    probability, that periodic tasks will stop for the duration of time
    the system clock was adjusted backward.

    This was supposed to be fixed by the following patch to oslo.service
    https://review.openstack.org/#/c/286838/ , but the order of imports
    in unit tests and production code is different, so nova services
    end up starting with the default eventlet hub, that does not use a
    monotonic clock and, thus, is affected by changes of system time.

    Testing this is problematic, as it's a subject of imports order and
    is not reproduced in functional or unit tests (oslo_service is always
    imported earlier than eventlet hub is initialized, so it just does
    "the right thing"). The alternative is to make an assertion when
    services start.

    Closes-Bug: #1510234

    Change-Id: I110cf31ad2a0c74a0cf30ec08bd94d3a56727b39

Changed in nova:
status: In Progress → Fix Released
Changed in masakari:
assignee: nobody → Dinesh Bhor (dinesh-bhor)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to masakari (master)

Fix proposed to branch: master
Review: https://review.openstack.org/503939

Changed in masakari:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 17.0.0.0b1

This issue was fixed in the openstack/nova 17.0.0.0b1 development milestone.

Changed in masakari:
assignee: Dinesh Bhor (dinesh-bhor) → SamP (sampath-priyankara)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to masakari (master)

Reviewed: https://review.openstack.org/503939
Committed: https://git.openstack.org/cgit/openstack/masakari/commit/?id=2c4574bfdcf9bd08efcb3c82becb26787a635338
Submitter: Zuul
Branch: master

commit 2c4574bfdcf9bd08efcb3c82becb26787a635338
Author: Dinesh Bhor <email address hidden>
Date: Wed Sep 13 16:12:33 2017 +0530

    Make eventlet hub use a monotonic clock

    If system time is adjusted first forward and then backward while a
    masakari-engine service is running, then the periodic tasks stops
    for the duration of time the system clock was adjusted backward.

    This was supposed to be fixed by the following patch to oslo.service
    https://review.openstack.org/#/c/286838/ , but the order of imports
    in unit tests and production code is different, so masakari services
    end up starting with the default eventlet hub, that does not use a
    monotonic clock and, thus, is affected by changes of system time.

    Testing the change done in the patch is problematic, as it's a
    subject of imports order and is not reproduced in functional or
    unit tests (oslo_service is always imported earlier than eventlet
    hub is initialized, so it just does "the right thing").
    The alternative is to make an assertion when services start.

    Co-Authored-By: Roman Podoliaka <email address hidden>
    Closes-Bug: #1510234
    Change-Id: I9d917b3151d9cdf7340a173b5baf98def63c76cd

Changed in masakari:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/537811

Matt Riedemann (mriedem)
Changed in nova:
assignee: Stephen Finucane (stephenfinucane) → Roman Podoliaka (rpodolyaka)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/pike)

Reviewed: https://review.openstack.org/537811
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=de7ae358b7b61d6d82e5fe17d0a790ff027cdc53
Submitter: Zuul
Branch: stable/pike

commit de7ae358b7b61d6d82e5fe17d0a790ff027cdc53
Author: Roman Podoliaka <email address hidden>
Date: Wed Feb 15 16:47:42 2017 +0200

    Make eventlet hub use a monotonic clock

    If system time is adjusted first forward and then backward while a
    nova service is running (e.g. nova-compute), then there is a high
    probability, that periodic tasks will stop for the duration of time
    the system clock was adjusted backward.

    This was supposed to be fixed by the following patch to oslo.service
    https://review.openstack.org/#/c/286838/ , but the order of imports
    in unit tests and production code is different, so nova services
    end up starting with the default eventlet hub, that does not use a
    monotonic clock and, thus, is affected by changes of system time.

    Testing this is problematic, as it's a subject of imports order and
    is not reproduced in functional or unit tests (oslo_service is always
    imported earlier than eventlet hub is initialized, so it just does
    "the right thing"). The alternative is to make an assertion when
    services start.

    Closes-Bug: #1510234

    Change-Id: I110cf31ad2a0c74a0cf30ec08bd94d3a56727b39
    (cherry picked from commit a7505ee648b22f5827117bcc1192b5654c29cf9c)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 16.1.0

This issue was fixed in the openstack/nova 16.1.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.