mismatched file locking since 1:4.2.8p4+dfsg-3ubuntu1 causes race leaving ntp dead on reboot
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
ntp (Debian) |
Fix Released
|
Unknown
|
|||
ntp (Ubuntu) |
Fix Released
|
Critical
|
Unassigned | ||
Xenial |
Fix Released
|
High
|
Unassigned | ||
Zesty |
Fix Released
|
High
|
Unassigned | ||
Artful |
Fix Released
|
Critical
|
Unassigned |
Bug Description
[Impact]
* The locks of ntpdate the ifup hook and the ntp service start do not
match, therefore installation of ntpdate can harmstring the start of
ntp at boot.
* The change ports back what Debian added later and we merged in Zesty.
It does two things:
1. it makes the lock paths actually match
2. it drops the usage of lockfile-progs which never was a dependency
and uses flock directly.
[Test Case]
* Prep
- Taking a Xenial VM (to avoid all the time set rejects in a container
from cluttering the view)
- Installing ntp
- Check status of ntp
- Reboot the VM
- Check status of ntp
# Until now all should be good
* Break it
- install ntpdate
- reboot
- Check status of ntp
- It (likely) is failed for "blocked known address being busy"
- This is somewhat of a race, adding more extra network devices in
libvirt to your guest increases the chance if you can't reproduce.
* Fix it
- install the fix from proposed (or the ppa in c#14)
- reboot
- ntp is now running correctly after reboot
[Regression Potential]
* It was locking before as well, just on a lock never contended and
potentially failing to have the lockfile-progs calls available.
Due to the change the init now of ntp can take longer (until the
ntpdate calls are out of the way)
* For a fallback in case locking goes crazy in unexpected ways the
timeout of the flock (180s) is intentionally not checked for bad return
codes. That way in those cases ntp still tries to initialize and if it
fails for an ntpdate blocking the port it didn't "loose" anything by
being stalled.
Therefor I'd consider that the actual regression potential rather
low and safe.
[Other Info]
* This is kind of a bug-zombie, fixed in zesty but resurrected in Debian
(and Ubuntu by our merge) due to the addition of a native systemd
service. Now that Dev is finally (again) good it is time to tackle the
Xenial SRU.
---
ntpdate and ntp conflict on the NTP well-known-socket. If ntp and ntpdate 1:4.2.8p4+
When the ntp service is started by systemd, ntp fails to bind the NTP socket because ntpdate is running in the background. It's intended that ntp and ntpdate try to avoid this conflict with a lock file, but the locking mechanism was changed in ntpdate.if-up (from lockfile to flock), but it was not changed in ntp.init. Previously the file locking prevented ntp from trying to start when ntpdate was running. Not any more.
Having multiple interfaces causes a much longer period of the socket being unavailable, because the 2 ntpdate processes will get serialized by the lock, while the ntp service is looking for a different lock, so it just plows right in. Attempts by netdate.if-up to stop and start ntp seem to overlap and when the final start is invoked, systemd seems to thing ntp is already running, though it has failed.
In 1:4.2.8p4+
debian/
Looks like corresponds to rev 371 of debian/
This change diverged locking between ntpdate.if-up and ntp.init. This was rectified in rev 451 of ntp.init, to use compatible locking, but that doesn't appear in the Ubuntu version.
System Information:
lsb_release -rd:
Description: Ubuntu 16.04.2 LTS
Release: 16.04
apt-cache policy ntpdate:
ntpdate:
Installed: 1:4.2.8p4+
Candidate: 1:4.2.8p4+
Version table:
*** 1:4.2.8p4+
100 /var/lib/
apt-cache policy ntp:
ntp:
Installed: 1:4.2.8p4+
Candidate: 1:4.2.8p4+
Version table:
*** 1:4.2.8p4+
100 /var/lib/
Changed in ntp (Debian): | |
status: | Unknown → Fix Released |
Changed in ntp (Ubuntu Zesty): | |
importance: | Undecided → High |
Thank you for taking the time to report this bug and helping to make Ubuntu better.
I've confirmed that the issue is as you describe in Xenial. Trusty pre-dates the change to locking in ntp.if-up, so is consistent. Zesty is also consistent in using flock in both places, as is Artful. So this bug affects only Xenial.
In order to understand the importance of this bug, please could you explain why you're using both ntp and ntpdate? ntpdate isn't installed by default on Xenial, and shouldn't be required in the normal case because ntp defaults to "-g". So you could work around this bug by just removing the ntpdate package. Is there a particular reason that this won't work in your case?