services hang when time is jumping forward and backward

Bug #1642103 reported by Eugene Nikanorov
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Won't Fix
High
MOS Nova
10.0.x
Won't Fix
High
MOS Nova
7.0.x
Won't Fix
High
MOS Maintenance
8.0.x
Won't Fix
High
MOS Maintenance
9.x
Won't Fix
High
MOS Maintenance

Bug Description

MOS 9.0

Consider the following scenario:
The cluster is power-cycled, some nodes have their system time reset due to hardware.
When the cluster is up, ntp starts synchronizing the time. Meanwhile, services try to connect to rabbitmq and waiting for rabbitmq to be available using time.sleep(), monkeypatched by eventlet.
For some reason during the sync the time gets adjusted forward and backwards several years (that much because default systime is 200x-01-01), that makes monkeypatched time.sleep($interval) to wait additional $interval time.

The root cause is that eventlet is using non-monotonic timer.
Related patch that fixes the issue for oslo: https://review.openstack.org/#/c/190372/
That approach should be extended to all services that utilize oslo_service

description: updated
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

>>> That approach should be extended to all services that utilize oslo_service

My understanding is that oslo_service part of the problem is actually fixed and now we need to make eventlet use a monotonic clock everywhere. The main thing we are interested in is probably Hub class: https://github.com/eventlet/eventlet/blob/master/eventlet/hubs/hub.py#L116

But it's initialized implicitly when used in OpenStack:

http://paste.openstack.org/show/589381/

thus, we'll probably need to patch eventlet directly. I took a quick look at https://github.com/eventlet/eventlet/blob/master/eventlet/hubs/hub.py module and looks like using of a monotonic clock there must be fine, as we are not really interested in time value itself, but only in differences between two given points of time.

But there are more usages of time.time() in eventlet - http://paste.openstack.org/show/589384/ - we'll need to check those as well.

I don't think this is suitable for a stable release, though. This must be thoroughly tested in our current development branches first.

Revision history for this message
Kevin Benton (kevinbenton) wrote :

This doesn't actually have anything to do with the eventlet monkey patch of time.sleep IIUC. It's in the calculation of the $interval that we pass into it.

At least that's what I've gathered from the fix applied in that patch, they are fixing their calculations of time deltas by not using clock references. I don't see where they are replacing a call to time.sleep itself.

Revision history for this message
Kevin Benton (kevinbenton) wrote :

Ah, based on Roman's response, there could be both cases. eventlet itself could be busted by clock jumps and the services (e.g. Neutron) could be screwed up by clock jumps when calculating intervals based on the clock.

Changed in mos:
importance: Undecided → High
Anton Matveev (amatveev)
tags: added: sla1
Revision history for this message
Dmitry Teselkin (teselkin-d) wrote :

Related to python components, passing to mos-packaging.

Revision history for this message
Vitaly Sedelnik (vsedelnik) wrote :

Making all services sustainable against clock jumps seems to be pretty big effort, let's try to fix for 10.0 and see what could be backported to 9.x.

As for 9.2 fix: Anton - you set it to customer-found - are time jumps reproducing constantly in customer environment? Can we just fix NTP to get things to work?

Revision history for this message
Eugene Nikanorov (enikanorov) wrote :

Well, when I said "all services" I meant making change in some library that is used by most of them.
One option is oslo_service, the work there has been done partially already.

tags: added: area-linux
Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

I have backorted https://review.openstack.org/#/c/286838/ into 9.2 here: https://review.fuel-infra.org/#/c/28713/

https://review.openstack.org/#/c/190372/ is already there starting from 8.0.

But as Roman pointed out, there are a lot more places in eventlet where time still is calculated incorrectly and we need to fix them before we can declare the bug to be fixed.

Revision history for this message
Vitaly Sedelnik (vsedelnik) wrote :

Won't Fix for 9.2, 8.0-updates, 7.0-updates as this is not a bug but feature request. We will consider implementing it in 10.

Dmitry Pyzhov (dpyzhov)
Changed in mos:
milestone: 9.2 → 10.0
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :
Changed in mos:
status: Confirmed → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.