NTP reload failure (unable to read library) on overlayfs

Bug #1701297 reported by Andres Rodriguez on 2017-06-29
122
This bug affects 20 people
Affects Status Importance Assigned to Milestone
cloud-init
Undecided
Unassigned
apparmor (Ubuntu)
Critical
Unassigned
cloud-init (Ubuntu)
Critical
Unassigned
linux (Ubuntu)
Critical
Unassigned

Bug Description

After update [1] of cloud-init in Ubuntu (which landed in xenial-updates on 2017-06-27), it is causing NTP reload failures.

https://launchpad.net/ubuntu/+source/cloud-init/0.7.9-153-g16a7302f-0ubuntu1~16.04.1

In MAAS scenarios, this is causing the machine to fail to deploy.

Related bugs:
 * bug 1645644: cloud-init ntp not using expected servers

Andres Rodriguez (andreserl) wrote :
Changed in cloud-init (Ubuntu):
importance: Undecided → Critical
Andres Rodriguez (andreserl) wrote :
summary: - NTP reload failure
+ NTP reload failure (causing deployment failures with MAAS)
description: updated
description: updated

Seems this is related to:

    - cc_ntp: write template before installing and add service restart
      [Ryan Harper] (LP: #1645644)

David Britton (davidpbritton) wrote :

Please attach the journalctl -xe failure output

Changed in cloud-init:
status: New → Incomplete
Andres Rodriguez (andreserl) wrote :

Note that this bug was reported by Ashley, you I filed it. That said, I've attempted to reproduce this with latest maas images containing cloud-init 0.7.9-153-g16a7302f-0ubuntu1~16.04.1 in MAAS 2.3 and have not come across such issue

Ashley Lai (alai) wrote :

ubuntu@donphan:~$ sudo journalctl -u ntp --no-pager --since today
sudo: unable to resolve host donphan
-- Logs begin at Thu 2017-06-29 16:36:06 UTC, end at Thu 2017-06-29 16:44:47 UTC. --
Jun 29 16:36:36 donphan systemd[1]: Starting LSB: Start NTP daemon...
Jun 29 16:36:36 donphan ntp[2441]: * Starting NTP server ntpd
Jun 29 16:36:36 donphan ntp[2441]: ...done.
Jun 29 16:36:36 donphan systemd[1]: Started LSB: Start NTP daemon.
Jun 29 16:36:36 donphan ntpd[2453]: proto: precision = 0.057 usec (-24)
Jun 29 16:36:36 donphan ntpd[2453]: Listen and drop on 0 v6wildcard [::]:123
Jun 29 16:36:36 donphan ntpd[2453]: Listen and drop on 1 v4wildcard 0.0.0.0:123
Jun 29 16:36:36 donphan ntpd[2453]: Listen normally on 2 lo 127.0.0.1:123
Jun 29 16:36:36 donphan ntpd[2453]: Listen normally on 3 eno1 10.245.208.209:123
Jun 29 16:36:36 donphan ntpd[2453]: Listen normally on 4 lo [::1]:123
Jun 29 16:36:36 donphan ntpd[2453]: Listen normally on 5 eno1 [fe80::eeb1:d7ff:fe73:1da4%2]:123
Jun 29 16:36:36 donphan ntpd[2453]: Listening on routing socket on fd #22 for interface updates
Jun 29 16:36:37 donphan systemd[1]: Stopping LSB: Start NTP daemon...
Jun 29 16:36:37 donphan ntp[2519]: * Stopping NTP server ntpd
Jun 29 16:36:37 donphan ntpd[2453]: ntpd exiting on signal 15 (Terminated)
Jun 29 16:36:37 donphan ntp[2519]: ...done.
Jun 29 16:36:37 donphan systemd[1]: Stopped LSB: Start NTP daemon.
Jun 29 16:36:37 donphan systemd[1]: Starting LSB: Start NTP daemon...
Jun 29 16:36:37 donphan ntp[2532]: * Starting NTP server ntpd
Jun 29 16:36:37 donphan ntp[2532]: /usr/sbin/ntpd: error while loading shared libraries: libcap.so.2: cannot stat shared object: Permission denied
Jun 29 16:36:37 donphan ntp[2532]: ...fail!
Jun 29 16:36:37 donphan systemd[1]: ntp.service: Control process exited, code=exited status=127
Jun 29 16:36:37 donphan systemd[1]: Failed to start LSB: Start NTP daemon.
Jun 29 16:36:37 donphan systemd[1]: ntp.service: Unit entered failed state.
Jun 29 16:36:37 donphan systemd[1]: ntp.service: Failed with result 'exit-code'.
ubuntu@donphan:~$

David Britton (davidpbritton) wrote :

Hi Ashley --

I've tried to repro on the latest xenial images, and I don't see this happening (with NTP setup in my MAAS). Given the odd shared object permission denied message, I wonder if there is some corruption going on on the system?

Until we get a clear list of reproduction steps, or direct access to the problem on the system, I don't see a way to proceed debugging.

Please set back to new if you get more details.

Thanks!

Changed in cloud-init (Ubuntu):
status: New → Incomplete
Scott Moser (smoser) wrote :

Ashley,
Can we get 'dmesg' output and syslog also?

Thanks,

Andres Rodriguez (andreserl) wrote :

Ok, so I have been able to consistently reproduce this issue:

1. Using MAAS 2.1.5
2. Deployed a machine, it failed with NTP failing to configure.
3. Added a global kernel param with apparmor=0. Deployed successfully.

[ 46.973551] audit: type=1400 audit(1498766510.281:15): apparmor="STATUS"
operation="profile_load" profile="unconfined" name="/usr/sbin/ntpd"
pid=2936 comm="apparmor_parser"
[ 48.051990] audit: type=1400 audit(1498766472.362:16): apparmor="DENIED"
operation="getattr" info="Failed name lookup - disconnected path" error=-13
profile="/usr/sbin/ntpd" name="overlay/etc/ld.so.cache" pid=3019
comm="ntpd" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
[ 48.052025] audit: type=1400 audit(1498766472.362:17): apparmor="DENIED"
operation="getattr" info="Failed name lookup - disconnected path" error=-13
profile="/usr/sbin/ntpd" name="lib/x86_64-linux-gnu/libcap.so.2.24"
pid=3019 comm="ntpd" requested_mask="r" denied_mask="r" fsuid=0 ouid=0

On Thu, Jun 29, 2017 at 3:07 PM, Andres Rodriguez <email address hidden>
wrote:

> syslog http://paste.ubuntu.com/24983800/
> dmesg http://paste.ubuntu.com/24983801/
> journal http://pastebin.ubuntu.com/24983806/
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1701297
>
> Title:
> NTP reload failure (causing deployment failures with MAAS)
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/cloud-init/+bug/1701297/+subscriptions
>

Ryan Harper (raharper) wrote :

that looks related to overlayfs use in ephemeral environment.

On Thu, Jun 29, 2017 at 3:24 PM, Ryan Harper <email address hidden>
wrote:

> [ 46.973551] audit: type=1400 audit(1498766510.281:15):
> apparmor="STATUS" operation="profile_load" profile="unconfined"
> name="/usr/sbin/ntpd" pid=2936 comm="apparmor_parser"
> [ 48.051990] audit: type=1400 audit(1498766472.362:16):
> apparmor="DENIED" operation="getattr" info="Failed name lookup -
> disconnected path" error=-13 profile="/usr/sbin/ntpd"
> name="overlay/etc/ld.so.cache" pid=3019 comm="ntpd" requested_mask="r"
> denied_mask="r" fsuid=0 ouid=0
> [ 48.052025] audit: type=1400 audit(1498766472.362:17):
> apparmor="DENIED" operation="getattr" info="Failed name lookup -
> disconnected path" error=-13 profile="/usr/sbin/ntpd"
> name="lib/x86_64-linux-gnu/libcap.so.2.24" pid=3019 comm="ntpd"
> requested_mask="r" denied_mask="r" fsuid=0 ouid=0
>
> On Thu, Jun 29, 2017 at 3:07 PM, Andres Rodriguez <<email address hidden>
> > wrote:
>
>> syslog http://paste.ubuntu.com/24983800/
>> dmesg http://paste.ubuntu.com/24983801/
>> journal http://pastebin.ubuntu.com/24983806/
>>
>> --
>> You received this bug notification because you are subscribed to the bug
>> report.
>> https://bugs.launchpad.net/bugs/1701297
>>
>> Title:
>> NTP reload failure (causing deployment failures with MAAS)
>>
>> To manage notifications about this bug go to:
>> https://bugs.launchpad.net/cloud-init/+bug/1701297/+subscriptions
>>
>
>

Changed in linux (Ubuntu):
importance: Undecided → Critical
Changed in apparmor (Ubuntu):
importance: Undecided → Critical

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in apparmor (Ubuntu):
status: New → Confirmed
Changed in linux (Ubuntu):
status: New → Confirmed
Andres Rodriguez (andreserl) wrote :

It seems that this only happens when using ga-16.04 and doesn't when using hwe-16.04

Tyler Hicks (tyhicks) wrote :

AppArmor has difficulties mediating filesystem access when overlayfs is involved. That's a known issue but isn't one that is easily solved due to the internal design of overlayfs and its use of private vfsmounts. It also isn't something that we're planning to fix for the 17.10 cycle.

I thought that we recently investigated a similar issue to this and determined that MAAS wouldn't enable AppArmor when it is initially provisioning a machine. I can't remember the exact details and I'm not confident that was the final solution but maybe that rings some bells for the others that were involved.

Andres Rodriguez (andreserl) wrote :

@Tyler,

That's is correct, MAAS 2.2.0+ sends the apparmor=0 for the ephemeral environments.

That said, however, this affects else who is not using 2.2 (which in fact, affects customers who are still in 2.1). Also, based on my testing, it seems that when using hwe-16.04 kernel this doesn't happen, but it does with the ga-16.04 kernel.

summary: - NTP reload failure (causing deployment failures with MAAS)
+ NTP reload failure (unable to read library) on overlayfs
John Johansen (jjohansen) wrote :

Andres,

can you be more specific about the kernel version of the hwe kernel you are seeing this on?

Daniel Axtens (daxtens) wrote :

Hi Tyler,

Do you know what the changes between the ga-16.04 and hwe-16.04 kernel are that make apparmor+overlayfs work? I'm worried we might hit this problem elsewhere...

Regards,
Daniel

Andres Rodriguez (andreserl) wrote :

When using the following kernel (the default Xenial kernel, aka ga-16.04 in MAAS), we see this issue:

4.4.0-83-generic #106-Ubuntu SMP Mon Jun 26 17:54:43 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

When using the HWE kernel (aka hwe-16.04 in MAAS), we do NOT see this issue:

4.8.0-58-generic #63~16.04.1-Ubuntu SMP Mon Jun 26 18:08:51 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

On 07/05/2017 08:14 PM, Daniel Axtens wrote:
> Hi Tyler,
>
> Do you know what the changes between the ga-16.04 and hwe-16.04 kernel
> are that make apparmor+overlayfs work?

No, we're not currently aware of any code changes that would cause the
behavioral change that is reported in the bug. Now that we have the
specific kernel version of the HWE kernel, John Johansen can look into
possible causes for the change.

Daniel Axtens (daxtens) wrote :

Tyler - thanks for that.

John - this is coming up in some internal support team escalations so I'm going to have a look at the kernel changes myself and will let you know if I find anything. I'd be keen to sync up if you have any leads.

Regards,
Daniel

John Johansen (jjohansen) wrote :

From an apparmor pov those 2 kernels are almost identical, with the 4.4 kernel picking up a couple of backport patches, that just do some simple remapping and should not affect behavior.

There are however some external changes that could affect apparmor mediation
  binfmt_elf change (9f834ec18defc369d73ccf9e87a2790bfa05bf46) - that could very well be related to the name="overlay/etc/ld.so.cache" failure

  there are several changes in overlayfs that may result in permission changes.

We can build som test kernels, a 4.4 kernel with the binfmt_elf commit cherry-picked and another one with the newer overlayfs.

Tyler Hicks (tyhicks) wrote :

John is going to build a test kernel, based on the ga-16.04 kernel, with the binfmt_elf commit cherry-picked from the hwe-16.04. That will let someone from the MAAS team attempt to reproduce the issue with the test kernel and, if the deployment succeeds, it'll tell us that the binfmt_elf commit is causing the change in behavior.

Tyler Hicks (tyhicks) wrote :

@Andres One thing that I'm struggling with is why this bug hasn't been seen before. IIUC, it should be present in the very first ga-16.04 kernel that Ubuntu 16.04 LTS was released with (in addition to earlier kernels while Xenial was a development release). Has MAAS 2.1.x and ga-16.04 kernels just now been used together or could there be some other change (in the kernel, MAAS, or maybe even something else) since the time that Ubuntu 16.04 LTS was released that we're not considering?

Would it be possible for someone on your team to test with the ga-16.04 kernel version 4.4.0-21.37 from the xenial-release pocket with both maas 2.0.0~beta3+bzr4941-0ubuntu1 from xenial-release and maas 2.1.5+bzr5596-0ubuntu1~16.04.1 from xenial-updates? I don't know how much effort that is but it would be helpful in understanding what changed in 16.04 to start triggering this bug.

Tyler Hicks (tyhicks) wrote :

To elaborate a bit more, the apparmor and overlayfs incompatibility has been a known kernel issue from before 16.04's release and, at this time, isn't something that is likely to be fixed in 16.04. I'd like to better understand if something changed in userspace that started tickling the incompatibility issue. Did MAAS change how it was using AppArmor in the recent SRU that took it to version 2.1.5?

Andres Rodriguez (andreserl) wrote :

@Tyler,

The reason why this wasn't seen before is that previously in Xenial, cloud-init did not restart 'ntp' with a new config file. Since cloud-init recently SRU'd a fixed cloud-init that does restart 'ntp' on overlay, the issue started to show up.

In other words, after a cloud-init bugfix , these issues started surfacing.

John Johansen (jjohansen) wrote :

Well that explains it. So we would have seen this issues from release except for the cloud-init bug.

Now we need to isolate the fix and backport it to the ga kernel.

Muthukumaran G (muthu883) wrote :

I am also getting the same issue with ga kernel
Where we hav to set the hwe kernel?

We have to change the hwe kernel to maas node or to the deploying nodes?

And apparmor and hwe kernel is mandatory to set to deployment work?

John Johansen (jjohansen) wrote :

There is a xenial test kernel at

http://people.canonical.com/~jj/lp1701297/

I have not had a chance to try it yet. I'll try to get to it in a few hours after some sleep.

Scott Moser (smoser) on 2017-07-11
description: updated
Scott Moser (smoser) wrote :

If you are affected by this bug, then you have the following options:
a.) upgrade to maas 2.2
 MAAS 2.2 sends 'apparmor=0' to the installation/commissionging kernel command
 line. 2.2 is in -proposed for 16.04, 16.10, 17.04 repositories and
 is already available in artful.

 Alternatively you can use the maas ppa (add-apt-repository ppa:maas/stable).
 Note, however, that updates to that repository are not managed by the same
 Stable Release Updates policy that is applied to the ubuntu release.

b.) cherry pick the commit to your installation.
 The MAAS commit that put 'a' into place is:
   https://git.launchpad.net/maas/commit/?id=df9a79b9dba9
 It is a one line change. You will need to restart the maas region
 controller after applying the change.

c.) add a global 'kernel_parameter' in maas 2.1.5 with 'apparmor=0'.
 **WARNING**: this will copy over the kernel parameter to the installed
 system, thus without further change, installed systems would run without
 apparmor.

Changed in cloud-init:
status: Incomplete → Won't Fix
Marzog (kaaalid) on 2017-08-08
Changed in linux (Ubuntu):
status: Confirmed → Fix Committed
Daniel Axtens (daxtens) wrote :

Hi Marzog,

What commit has been committed to Linux? I cannot find it.

Regards,
Daniel

An addition to workaround b.) from #32.

We changed the code in question and restarted the maas rackd service which fixed the issue.

Patrick M. Womack (ipat8) wrote :

Hi all, sorry to necro a thread, but I'm having this issue on MAAS 2.3.1. Is anyone able to provide some insight into this issue?

Changed in apparmor (Ubuntu):
status: Confirmed → Invalid
Changed in linux (Ubuntu):
status: Fix Committed → Fix Released
Brad Figg (brad-figg) on 2019-07-24
tags: added: cscc
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers