boot-time race between /etc/network/if-up.d/ntpdate and "/etc/init.d/ntp start"

Bug #1125726 reported by Thomas Bushnell, BSG on 2013-02-15
22
This bug affects 2 people
Affects Status Importance Assigned to Milestone
ntp (Ubuntu)
Medium
Unassigned
Precise
Medium
Cam Cope
Trusty
Medium
Cam Cope

Bug Description

[Impact]
* Hardware clocks are not stepped at boot, which can prevent NTP from ever
  syncing the clock.
  Incorrect clocks can cause serious issues in distributed systems.

* Upstream originally added a lock file to eliminate a race between the ntp
  service (which keeps the clock synchronized during normal operation) and
  ntpdate (which is used to step the clock by large intervals at boot time).
  That change had a flaw which introduced a deadlock. An Ubuntu patch was
  applied which broke the locking mechanism entirely, reintroducing the race
  condition.

* This change undoes the Ubuntu patch and fixes the deadlock by unlocking
  before attempting to start the ntp service.

[Test Case]

* There are two bugs: The race, and the deadlock. To reproduce the race more
  consistently:
  - add 'sleep 30' to '/etc/network/if-up.d/ntpdate' on the line preceding
    '/usr/sbin/ntpdate-debian -s $OPTS 2>/dev/null || :', and comment out
    'invoke-rc.d --quiet $service stop >/dev/null 2>&1 || true'. This will
    reproduce the case where the ntp service starts between the stop command
    and the ntpdate command.
    The result will be that the ntpdate command fails. There will be a
    message in syslog like:
      'ntpdate[17660]: the NTP socket is in use, exiting'
  - Reintroducing the lock brings back the deadlock issue. Both the ntpdate
    if-up.d script and the ntp init script check the lock file, but the
    ntpdate script attempted to start the ntp init script before unlocking
    the lock. Moving the unlock before the init script invocation fixes
    the deadlock. The original deadlock behavior is described here:
      https://bugs.launchpad.net/ubuntu/+source/ntp/+bug/246203

[Regression Potential]

* Low. Out-of-sync clocks could be changed a large amount at boot time, but
  only for machines with static IP's. The clock is only likely to be in this
  state if the clock was very skewed at boot time, which is also unlikely
  since NTP usually keeps the software clock in sync during operation and
  the hardware clock is updated at shutdown.

In addition, /etc/dhcp/dhclient-exit-hooks.d/ntp is *also* getting in on the act, doing an ntp restart when it sees ntp service information from the DHCP server.

Robie Basak (racb) wrote :

Thanks Thomas.

It looks like this exists in the Ubuntu delta only, introduced in the fix for bug 246203. If we used the same lock file, I think we'd get a deadlock again.

Is there a specific operational problem here? The current arrangement does seem messy, but I'm reluctant to touch it if it works right now.

What do you think?

We're seeing a possibly related problem on first boot, with more painful consequences. Our install process does a puppet run in the late_command, and then a reboot, and then another puppet run happens on boot.

In that first boot to the installed system, we're seeing ntp start once and fail, reporting that it cannot bind UDP *.ntp, and then it doesn't run at all. This is different from the symptom I noticed above. I agree that bug 246203 was a real bug with the old way; that certainly explains why simply using the same lock file will not be good.

Robie Basak (racb) wrote :

Thanks Thomas.

I'm not sure what to do with this bug now though. Do you think you could distil your problem down to a failure case that applies generally? Or should we just leave this bug as a wishlist item to improve the ntp/ntpdate interaction? I guess the latter would need to be forwarded to Debian, as it would be awkward to diverge from them significantly on this.

I don't really understand why ntpdate can't just bind to an unprivileged port, leave ntpd alone, and thus not need any interaction. It seems to have a -u option to do this. Perhaps there's something I'm missing.

Changed in ntp (Ubuntu):
importance: Undecided → Low
status: New → Confirmed
Serge Hallyn (serge-hallyn) wrote :

Is there any danger of the rc2.d restarting ntp before if-up.d/ntpdate script gets done starting ntpdate, causing ntpdate to fail?

Since we know that if-up.d/ntpdate will eventually start ntp, we could define a transient 'ntp-will-start' upstart job. Have if-up.d/ntpdate start ntp-will-start, stop ntp, do it's thing, stop ntp-will-start, then start ntp. At that point, if there are two 'start ntp's it doesn't matter, one will suceed.

Paul Szabo (psz-maths) wrote :

See also
https://bugs.launchpad.net/ubuntu/+source/ntp/+bug/288905
where I said:
  The way I read "man ntpd" (on Debian wheezy), we could (should?) replace ntpdate by
  "ntpd -q"; and if we are going to run ntpd then ntpdate is unnecessary anyway.
  If we have (or are going to have) ntpd, then we should simply skip /etc/network/if-up.d/ntpdate;
  seeing how that depends on NTPSERVERS in /etc/ntp.conf or somesuch, I do not see that
  /etc/network/if-up.d/ntpdate is ever any use.

Cam Cope (ccope) wrote :

The fix for 246203 was wrong: http://bazaar.launchpad.net/~ubuntu-branches/ubuntu/karmic/ntp/karmic/revision/23

The issue is that the ntp init script is started inside the ntpdate ifup script BEFORE the lock file is unlocked. That init script then blocks because the lock is taken. The correct fix is to unlock before starting the init script.

The lock is supposed to be shared because both services cannot run at the same time, otherwise one of them will fail to start. ntpd usually wins right now because it gets started more times, but we have encountered a fairly easy to reproduced scenario where ntpdate fails to step the clock at boot, resulting in wildly inaccurate system times. I have attached a patch which fixes the problem here.

tags: added: patch
Cam Cope (ccope) wrote :

In case it wasn't clear, my patch is supposed to be for the debian/ntpdate.if-up file. Also, I think the priority of this bug should be higher, it was assigned 'low' when there was no clear problem caused by the race. Systems booting with uncorrectable clock skew can be a serious problem.

Changed in ntp (Ubuntu):
importance: Low → Medium
Iain Lane (laney) wrote :

Thanks Cam, I'm going to upload this to Xenial.

If you want this to be uploaded to a stable release, please provide the required information (QA information, regression potential, etc) from https://wiki.ubuntu.com/StableReleaseUpdates#Procedure

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package ntp - 1:4.2.6.p5+dfsg-3ubuntu9

---------------
ntp (1:4.2.6.p5+dfsg-3ubuntu9) xenial; urgency=medium

  [ Cam Cope ]
  * Use a single lockfile again - instead unlock the file before starting the
    init script. The lock sho uld be shared - both services can't run at the
    same time. (LP: #1125726)

 -- Iain Lane <email address hidden> Mon, 07 Dec 2015 13:38:16 +0000

Changed in ntp (Ubuntu):
status: Confirmed → Fix Released
Cam Cope (ccope) on 2015-12-07
description: updated
Changed in ntp (Ubuntu Precise):
importance: Undecided → Medium
Changed in ntp (Ubuntu Trusty):
importance: Undecided → Medium
Changed in ntp (Ubuntu Precise):
status: New → Triaged
Changed in ntp (Ubuntu Trusty):
status: New → Triaged
Iain Lane (laney) wrote :

Uploaded both now, thanks again!

Changed in ntp (Ubuntu Precise):
status: Triaged → In Progress
Changed in ntp (Ubuntu Trusty):
status: Triaged → In Progress
Changed in ntp (Ubuntu Precise):
assignee: nobody → Cam Cope (ccope)
Changed in ntp (Ubuntu Trusty):
assignee: nobody → Cam Cope (ccope)

Hello Thomas, or anyone else affected,

Accepted ntp into trusty-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/ntp/1:4.2.6.p5+dfsg-3ubuntu2.14.04.7 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in ntp (Ubuntu Trusty):
status: In Progress → Fix Committed
tags: added: verification-needed
Changed in ntp (Ubuntu Precise):
status: In Progress → Fix Committed
Brian Murray (brian-murray) wrote :

Hello Thomas, or anyone else affected,

Accepted ntp into precise-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/ntp/1:4.2.6.p3+dfsg-1ubuntu3.8 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Cam Cope (ccope) wrote :

I can confirm I've been running this in production and have not seen any further issues.

tags: added: verification-done
removed: verification-needed
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package ntp - 1:4.2.6.p5+dfsg-3ubuntu2.14.04.7

---------------
ntp (1:4.2.6.p5+dfsg-3ubuntu2.14.04.7) trusty; urgency=medium

  * Use a single lockfile again - instead unlock the file before starting the
    init script. The lock sho uld be shared - both services can't run at the
    same time. (LP: #1125726)

 -- Cam Cope <email address hidden> Tue, 19 Jan 2016 10:22:39 +0000

Changed in ntp (Ubuntu Trusty):
status: Fix Committed → Fix Released

The verification of the Stable Release Update for ntp has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package ntp - 1:4.2.6.p3+dfsg-1ubuntu3.8

---------------
ntp (1:4.2.6.p3+dfsg-1ubuntu3.8) precise; urgency=medium

  * Use a single lockfile again - instead unlock the file before starting the
    init script. The lock sho uld be shared - both services can't run at the
    same time. (LP: #1125726)

 -- Cam Cope <email address hidden> Tue, 19 Jan 2016 10:20:07 +0000

Changed in ntp (Ubuntu Precise):
status: Fix Committed → Fix Released

This fix is causing problems on Ubuntu 12.04 for me; for both KVM hosts and KVM guests. I see a message like

lockfile creation failed: exceeded maximum number of lock attempts

On my hosts, it delays boot finishing for several minutes; while some of my guests just never become network accessible.

For anyone else bitten by same issue, I am currently using this workaround:
chmod -x /usr/sbin/ntpdate-debian

Cam Cope (ccope) wrote :

Nathan: How many interfaces or IP's are you bringing up? That error message makes it sound like there could be a lot of contention on the lock. Could you also get the output of `pstree | grep -B3 lockfile` while a VM is coming up? (You'll need to attach to a free virtual terminal using the kvm console).

Upon reading more of the lockfile-create manpage, it appears that there's a non-configurable 5-minute timeout on stale locks. Setting the --use-pid option might free up the lock more quickly if the parent process has died for some reason.

It's not clear to me how this could prevent networking from coming up, since the network has to be up for NTP to run, and the if-up.d script backgrounds the ntpdate locking+syncing script. sshd in 12.04 and 14.04 is started from an upstart script which does not depend on the NTP service. The NTP service itself is fairly early in the sysvinit order at S23, so there might be other init scripts blocked behind it.

Hi Cam,

On our hosts, 4 physical interfaces and then a bunch of bonds and bridges taking total up to 12 entries in /etc/network/interfaces . So contention certainly seems plausible?

My guests have actually gone back to working normally, so I likely have mis-attributed an unrelated problem that occurred at same time.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Patches