mismatched file locking since 1:4.2.8p4+dfsg-3ubuntu1 causes race leaving ntp dead on reboot

Bug #1706818 reported by Josh Littlefield on 2017-07-27
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
ntp (Debian)
Fix Released
Unknown
ntp (Ubuntu)
Critical
Unassigned
Xenial
High
Unassigned
Zesty
High
Unassigned
Artful
Critical
Unassigned

Bug Description

[Impact]

 * The locks of ntpdate the ifup hook and the ntp service start do not
   match, therefore installation of ntpdate can harmstring the start of
   ntp at boot.

 * The change ports back what Debian added later and we merged in Zesty.
   It does two things:
   1. it makes the lock paths actually match
   2. it drops the usage of lockfile-progs which never was a dependency
      and uses flock directly.

[Test Case]

 * Prep
   - Taking a Xenial VM (to avoid all the time set rejects in a container
     from cluttering the view)
   - Installing ntp
   - Check status of ntp
   - Reboot the VM
   - Check status of ntp
 # Until now all should be good
 * Break it
   - install ntpdate
   - reboot
   - Check status of ntp
     - It (likely) is failed for "blocked known address being busy"
   - This is somewhat of a race, adding more extra network devices in
     libvirt to your guest increases the chance if you can't reproduce.
 * Fix it
   - install the fix from proposed (or the ppa in c#14)
   - reboot
   - ntp is now running correctly after reboot

[Regression Potential]

 * It was locking before as well, just on a lock never contended and
   potentially failing to have the lockfile-progs calls available.
   Due to the change the init now of ntp can take longer (until the
   ntpdate calls are out of the way)
 * For a fallback in case locking goes crazy in unexpected ways the
   timeout of the flock (180s) is intentionally not checked for bad return
   codes. That way in those cases ntp still tries to initialize and if it
   fails for an ntpdate blocking the port it didn't "loose" anything by
   being stalled.
   Therefor I'd consider that the actual regression potential rather
   low and safe.

[Other Info]

 * This is kind of a bug-zombie, fixed in zesty but resurrected in Debian
   (and Ubuntu by our merge) due to the addition of a native systemd
   service. Now that Dev is finally (again) good it is time to tackle the
   Xenial SRU.

---

ntpdate and ntp conflict on the NTP well-known-socket. If ntp and ntpdate 1:4.2.8p4+dfsg-3ubuntu5.5 are installed on Xenial, and there are 2 static interfaces configured, most often we find that ntpd is not running after a reboot.

When the ntp service is started by systemd, ntp fails to bind the NTP socket because ntpdate is running in the background. It's intended that ntp and ntpdate try to avoid this conflict with a lock file, but the locking mechanism was changed in ntpdate.if-up (from lockfile to flock), but it was not changed in ntp.init. Previously the file locking prevented ntp from trying to start when ntpdate was running. Not any more.

Having multiple interfaces causes a much longer period of the socket being unavailable, because the 2 ntpdate processes will get serialized by the lock, while the ntp service is looking for a different lock, so it just plows right in. Attempts by netdate.if-up to stop and start ntp seem to overlap and when the final start is invoked, systemd seems to thing ntp is already running, though it has failed.

In 1:4.2.8p4+dfsg-3ubuntu1 the following change was made:
  debian/ntpdate.if-up: Drop lockfile mechanism as upstream is using flock now.
Looks like corresponds to rev 371 of debian/ntpdate.if-up from upstream.

This change diverged locking between ntpdate.if-up and ntp.init. This was rectified in rev 451 of ntp.init, to use compatible locking, but that doesn't appear in the Ubuntu version.

System Information:

 lsb_release -rd:
   Description: Ubuntu 16.04.2 LTS
   Release: 16.04

 apt-cache policy ntpdate:

   ntpdate:
     Installed: 1:4.2.8p4+dfsg-3ubuntu5.5
     Candidate: 1:4.2.8p4+dfsg-3ubuntu5.5
     Version table:
    *** 1:4.2.8p4+dfsg-3ubuntu5.5 100
           100 /var/lib/dpkg/status

 apt-cache policy ntp:

   ntp:
     Installed: 1:4.2.8p4+dfsg-3ubuntu5.5
     Candidate: 1:4.2.8p4+dfsg-3ubuntu5.5
     Version table:
    *** 1:4.2.8p4+dfsg-3ubuntu5.5 100
           100 /var/lib/dpkg/status

Robie Basak (racb) wrote :

Thank you for taking the time to report this bug and helping to make Ubuntu better.

I've confirmed that the issue is as you describe in Xenial. Trusty pre-dates the change to locking in ntp.if-up, so is consistent. Zesty is also consistent in using flock in both places, as is Artful. So this bug affects only Xenial.

In order to understand the importance of this bug, please could you explain why you're using both ntp and ntpdate? ntpdate isn't installed by default on Xenial, and shouldn't be required in the normal case because ntp defaults to "-g". So you could work around this bug by just removing the ntpdate package. Is there a particular reason that this won't work in your case?

Changed in ntp (Ubuntu):
status: New → Fix Released
Changed in ntp (Ubuntu Xenial):
status: New → Triaged
Robie Basak (racb) wrote :

Note for the server team: what I'd like to determine is if the user impact here justifies an SRU, and if it does, how we should prioritise it.

Changed in ntp (Debian):
status: Unknown → Fix Released
Paul Gear (paulgear) wrote :

https://bugs.launchpad.net/ubuntu/+source/ntp/+bug/1593907 was recently released to fix ntp/ntpdate interactions; if they're still broken, there seems little need to justify this given that precedent. (Full disclosure: I want ntp/ntpdate interactions solid so I can use ntpdate in the ntp charm. :-)

Robie Basak (racb) wrote :

Why does the ntp charm use ntpdate on Xenial, or need to use it?

@Robie - IIRC Paul wants to use it as "remote ntp check" which actually should be done via sntp but that binary was unavailable until artful and the fix doesn't fit an SRU (bug 1604010).

@Josh - just a day after your report the SRU completed that took out the old locking and the silly unconditional restart from the ifup hooks. I'd hope things should be much better for you now - could you please verify if that is true for your scenario as well?
If not we have to search where the remaining issue comes from as the new behavior should have ntp preferred to ntpdate and ntpdate being the one rejected as ntp already has the port used.
If there is a window e.g. due to the many interfaces that you described - that inverts that with ntpdate blocking ntp still that would have to be fixed.

Changed in ntp (Ubuntu Xenial):
status: Triaged → Incomplete
Paul Gear (paulgear) wrote :

@Robie The ntpdate command has a test mode that does not attempt to set the clock, which the sntp command lacks.

I've confirmed on multiple systems that the installation of ntpdate prevents ntp from starting on boot, which IMO is worthy of fixing.

Changed in ntp (Ubuntu Xenial):
status: Incomplete → New

- Taking a Xenial and a Artful VM
- Installing ntp
- Check status of ntp
  - running fine on both systems
- Reboot the VM
- Check status of ntp
  - still ntp service ok on both systems
- install ntpdate
- Check status of ntp
  - still ntp service ok on both systems
- reboot
- Check status of ntp
  - failed for blocked known address being busy on both
- reboot (to check reproducibility)
- Check status of ntp
  - failed for blocked known address being busy on both
- Adding two extra devices in libvirt and configuring it on the guest
- restart
- Check status of ntp
  - failed for blocked known address being busy on both (likely even at a higher "risk")

ntp init mechanims:
Xenial: /etc/init.d/ntp locks LOCKFILE=/var/lock/ntpdate
Artful: /usr/lib/ntp/ntp-systemd-wrapper locks nothing at all

This races against the following hook (in both releases):
/etc/network/if-up.d/ntpdate which locks /run/lock/ntpdate

Seems reproducible enough to me, actually much better reproducible than in the past.
I checked and our recent cleanup of the mess around debian/ntpdate.if-up fixed a lot of things.
Among other it removed an accidential restart of ntp which kind of hid this issue here (no regression-update, just an issue existing before now more likely to be hit).

But now with things no more that racy the fix is easy and much better testable.

Changed in ntp (Ubuntu Xenial):
status: New → Confirmed
Changed in ntp (Ubuntu):
status: Fix Released → Confirmed

Removing the last uncertainties, the related Debian bug being fixed fixed it in sysv init.
That now has:
 LOCKFILE=/run/lock/ntpdate

It has so e.g. in Artful.

And thereby we considered being fixed as well after merging that.
But versions >= that also have the systemd wrapper which doesn't lock at all.

Need to check Zesty if that might have no systemd wrapper yet but fixed sysv-init only.
...
Yep as I thought it really was fixed in Zesty by not (yet) having the systemd wrapper.

Changed in ntp (Ubuntu Zesty):
status: New → Fix Released
Changed in ntp (Ubuntu Artful):
importance: Undecided → Critical
Changed in ntp (Ubuntu Xenial):
importance: Undecided → High

TL;DR we need:
- a fix similar to what the sysv script got, but for the systemd wrapper
- the fix to the sysv script we picked up with out ntp merge in Zesty backported to Xenial

@Paul - thanks btw to bring that back up again!
Now that the reproducibility is improved by the last fixes this all makes way more sense.

Well while diving down on this I think the ntpdate hook can (on a running server) collect waiting flocks on every ifup/down - but the call in the hook already used -n which is --nonblock - so in case the lock isn't avail the ntpdate is skipped. Good no extra issue here.

To do this right for the systemd wrapper is harder.
We could go with ExecStartPre / ExecStopPost or code it into the systemd wrapper.
The pre/post solution is "more different" from what we had before in sysv, also I didn't find any other good examples f extra locks in systemd services to build upon - so stick with the change systemd wrapper.

With experimental changes I checked:
- sudo flock --timeout 5 /run/lock/ntpdate sleep 20 will hold the daemon start correctly until out of the way.
- I also see a correct error message now if that happens to time out.
- After the service started the lock is free it seems.
  But that isn't a problem the hook "can fail" without breaking anything - in case it is
  configured very differently it might even work.
  TL;DR - ntp would be protected from ntpdate, but vice versa that should not be needed.

Note - the old timeout that was used was 180 seconds, I keep that.

Setting up PPAs now ...

With ppa's at [1],[2] I tested and while it seemed 100% reproducible before it now seems to work on artful.
On Xenial I hit the error at least once still.

So there seems to be a race left that needs further analysis.
Actually the change changed more than just the file - I'll tear the Debian change apart and make sure it is fully backported for the next test.

[1]: https://launchpad.net/~ci-train-ppa-service/+archive/ubuntu/2946
[2]: https://launchpad.net/~ci-train-ppa-service/+archive/ubuntu/2947

Changed my fix and tests are good now.

Pushed MPs for review now.
@Paul - if you could test the ppas for your cases in advance as well, that would be great.

Review is good, pushed the artful fix and added an SRU Template.
Also reported the follow on bug to Debian as https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=874540 (I can't link two Debian bugs on the same paackage can I?)

description: updated

Well LP seems to have added it to wtaches, just adding another task with it doesn't work.
Waiting for Artful to migrate properly now.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package ntp - 1:4.2.8p10+dfsg-5ubuntu2

---------------
ntp (1:4.2.8p10+dfsg-5ubuntu2) artful; urgency=medium

  * d/ntp-systemd-wrapper protect systemd service startup from concurrent
    ntpdate processes the same way it was protected on sysv-init (LP: #1706818)

 -- Christian Ehrhardt <email address hidden> Tue, 05 Sep 2017 15:09:08 +0200

Changed in ntp (Ubuntu Artful):
status: Confirmed → Fix Released

Artful is done (the systemd version of it), Xenial now has the backport of the fix we had in zesty for SRU consideration in xenial-unapproved.

Changed in ntp (Ubuntu Xenial):
status: Confirmed → In Progress

Hello Josh, or anyone else affected,

Accepted ntp into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/ntp/1:4.2.8p4+dfsg-3ubuntu5.7 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-xenial to verification-done-xenial. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-xenial. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in ntp (Ubuntu Xenial):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-xenial
Changed in ntp (Ubuntu Zesty):
importance: Undecided → High
Paul Gear (paulgear) wrote :

@paelzer: Sorry for the delay - I was at a conference last week.

I've confirmed that the PPA version for xenial results in reliable restarts for ntpd. I tested 8 systems, 3 of which were updated to the new version from the PPA: the 3 upgraded VMs restarted reliably on each reboot, whilst 3 of the remaining VMs did not restart reliably (failing on 2/3 reboots), and the last 2 failed on 3/3 reboots.

Ok Paul, the ppa is code identical to the code in proposed.
I further did a check on x-proposed on top and it works as expected.
Setting verification done.

tags: added: verification-done verification-done-xenial
removed: verification-needed verification-needed-xenial
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package ntp - 1:4.2.8p4+dfsg-3ubuntu5.7

---------------
ntp (1:4.2.8p4+dfsg-3ubuntu5.7) xenial; urgency=medium

  * d/ntp.init: fix lock path to match the ntpdate ifup hook. Furthermore
    drop the usage of lockfile-progs calls and instead use flock directly.
    This is a backport of changes made in 1:4.2.8p7+dfsg-1 (LP: #1706818)

 -- Christian Ehrhardt <email address hidden> Tue, 05 Sep 2017 17:24:43 +0200

Changed in ntp (Ubuntu Xenial):
status: Fix Committed → Fix Released

The verification of the Stable Release Update for ntp has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.