sssd appears to crash AWS c5 and m5 instances, cause 100% CPU

Bug #1746806 reported by Paul Friel on 2018-02-01
106
This bug affects 15 people
Affects Status Importance Assigned to Milestone
cloud-images
High
Unassigned
linux (Ubuntu)
Critical
Kamal Mostafa
Xenial
Undecided
Kamal Mostafa
linux-aws (Ubuntu)
Critical
Kamal Mostafa
Xenial
Undecided
Kamal Mostafa

Bug Description

After upgrading to the Ubuntu EC2 AMI from 20180126 (specifically ami-79873901 in us-west-2) we have seen sssd hard locking c5 and m5 EC2 instances after starting the service and CPU goes to 100%.

We do not experience this issue with t2 or c4 instance types and we do not see this issue on any instance types using Ubuntu Cloud images from 20180109 or before. I have verified that this is kernel related as I booted an image that we created using the Ubuntu cloud image from 20180109 which works fine on a c5. I then did a "apt update && apt install --only-upgrade linux-aws && systemctl disable sssd", rebooted the server, verified I was on the new kernel and started sssd with "systemctl start sssd" and the EC2 instance froze and Cloudwatch CPU usage for that instance went to 100%.

I haven't been able to find much in the syslog, kern.log, journalctl logs, etc. The only thing I have been able to find is that when this happens I tend to see "^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@" in the syslog and sssd log files. I have attached several log files and the output of a "apport-bug /usr/sbin/sssd". Let me know if you need anything else to help track this down.

Thanks,
Paul

Paul Friel (pfriel) wrote :
Paul Friel (pfriel) wrote :
Paul Friel (pfriel) wrote :
Paul Friel (pfriel) wrote :
Paul Friel (pfriel) wrote :
Paul Friel (pfriel) wrote :
Dan Watkins (daniel-thewatkins) wrote :

Added the cloud-images project so we can more easily track this.

Eric Heydrick (eheydrick) wrote :

Seeing this as well. The system runs freeipa-client/sssd. If I remove sssd from the bootstrap the node does not lock up. I captured an strace showing the sssd startup is the last thing that happens before the 100% CPU and lock up.

Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in sssd (Ubuntu):
status: New → Confirmed
Paul Friel (pfriel) wrote :

I took a few minutes this afternoon and tried several different kernels, here is what I found:

works fine: 4.4.0.1047.49 (packaged with AWS Ubuntu Cloud Image from 20180109)
BROKEN: 4.4.0.1049.51 (installed with "apt update && apt install linux-aws=4.4.0.1049.51 linux-image-aws=4.4.0.1049.51 linux-headers-aws=4.4.0.1049.51")
BROKEN: 4.4.0.1050.52 (latest available as of 2018-02-01 using "apt update && apt install --only-upgrade linux-aws")
works fine: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.13.16/ (generic amd64 build)
works fine: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.14.16/ (generic amd64 build)

Robie Basak (racb) wrote :

If this problem comes and goes as you switch kernel, this points to a possible kernel regression. In that case perhaps sssd is just the trigger rather than the buggy package. So I'll add the kernel package to this bug.

Or perhaps it's a latent bug in sssd that only triggers with the newer kernel that is still following spec. I guess we won't be able to tell without further investigation.

In any case, letting the kernel team know seems appropriate.

tags: added: regression-update

I'm experiencing this issue too with artful on c5.large. During the installation of sssd-common 1.15.3-2ubuntu1, the last line of output is "Warning failed to create cache: usr.sbin.sssd" before the instance becomes unresponsive. If I use t2.medium, I don't experience this issue.

Regions tested: us-west-2 and eu-west-1

I do believe that I've experienced this issue without installing sssd, sometimes the machine fails to boot from the initial launch of the AMI. It freezes at "kernel: iscsi: registered transport (iser)"

Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux (Ubuntu):
status: New → Confirmed

Thanks. That's useful information.

On Fri, Feb 02, 2018 at 10:17:37AM -0000, David JM Emmett wrote:
> "Warning failed to create cache: usr.sbin.sssd" before the instance

That sounds AppArmor related perhaps?

John Johansen (jjohansen) wrote :

Maybe but we would more information to say for sure.

There have been no changes in apparmor between the reported working 20180109 and 20180126.

The warning
> "Warning failed to create cache: usr.sbin.sssd" before the instance

just means that apparmor was not able to cache the binary policy that it loaded. This is not unusual if policy configuration hasn't been updated some image configurations. Eg. if /etc/ is ro and the apparmor cache is at its default location of /etc/apparmor.d/cache. This warning would come during packaging install or boot, before sshd is run.

We can easily test whether apparmor policy load is causing the issue by manually calling the apparmor_parser on policy separate from invoking the application/services associated with the fault.

  sudo apparmor_parser -rK /etc/apparmor.d/

we can also decouple apparmor policy enforcement from the application/serives by disabling the profile on the instance
  sudo aa-disable /etc/apparmor.d/usr.sbin.sssd

or all profiles
  sudo systemctl disable apparmor.service

and we can disable apparmor from being used on the kernel at boot by adding the kernel parameter
  apparmor=0

Changed in linux (Ubuntu):
importance: Undecided → Critical
tags: added: kernel-key
Joseph Salisbury (jsalisbury) wrote :

I'd like to perform a kernel bisect to identify the commit that introduced this regression. First, can you test the 4.4.0-1048 kernel to narrow it down a bit more? That kernel can be downloaded from:
https://launchpad.net/~canonical-kernel-security-team/+archive/ubuntu/ppa/+build/14224482

Changed in linux-aws (Ubuntu):
status: New → Confirmed
importance: Undecided → Critical
assignee: nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu):
assignee: nobody → Joseph Salisbury (jsalisbury)
Kamal Mostafa (kamalmostafa) wrote :

@Joseph, etc.

linux-aws 4.4.0-1048.57 does not appear to exhibit the problem. And I confirm pfriel's findings:

1047 works ok
1048 works ok
1049 broken
1050 broken

Paul Friel (pfriel) wrote :

@racb and @jjohansen

I installed kernel 4.4.0-1050-aws, disabled sssd and apparmor on boot, and restarted on a c5 and it boots fine.. also boots fine just disabling sssd on boot.

If I start sssd without apparmor running everything is fine. If I start apparmor first, then start sssd it freezes up.

Kamal Mostafa (kamalmostafa) wrote :

I was able to reproduce the problem with the -generic Xenial kernel (on a c5.large instance):

4.4.0-109.132 works ok
4.4.0-110.133 broken
4.4.0-112.135 still broken

Those results correspond as expected to the linux-aws kernel versions. The problem appears to have been introduced by something in Xenial -generic 4.4.0-110.133.

John Johansen (jjohansen) wrote :

The are no changes to apparmor in that range, but that does cover the kaiser changes. Since there were no apparmor changes and kaiser changes the kernel userspace memory interaction my guess is that something is triggering in the copy_from_user when policy is loaded.

tags: added: pti
Joseph Salisbury (jsalisbury) wrote :

I started a kernel bisect between Ubuntu-4.4.0-109 and Ubuntu-4.4.0-110. The kernel bisect will require testing of about 5 test kernels.

I built the first test kernel, up to the following commit:
d3d0f0a209ee29cf553b8b5580eb954b0d4aa970

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1746806

Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.

To test this kernel, install both the linux-image and linux-image-extra .deb packages.

Thanks in advance

Paul Friel (pfriel) wrote :

@jsalisbury: I installed that kernel and rebooted using a c5.xl, it froze. I booted into a c4.xl and it booted fine, disabled the apparmor service and rebooted into a c5.xl and it booted fine. Re-enabled apparmor and rebooted into the c5.xl again and it froze on boot.

Given that the c5 instances are Skylake, and I read somewhere that there’s some special edge-case for Skylake and later CPUs, is this reproducible on proper hardware, or is it limited to KVM (prior to AWS’s c5s, the hypervisors were xen).

I’ve got a couple of Skylake machines available, and will install sssd and see what happens.

Cheers,
David

> On 5 Feb 2018, at 21:59, Paul Friel <email address hidden> wrote:
>
> @jsalisbury: I installed that kernel and rebooted using a c5.xl, it
> froze. I booted into a c4.xl and it booted fine, disabled the apparmor
> service and rebooted into a c5.xl and it booted fine. Re-enabled
> apparmor and rebooted into the c5.xl again and it froze on boot.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1746806
>
> Title:
> sssd appears to crash AWS c5 and m5 instances, cause 100% CPU
>
> Status in cloud-images:
> New
> Status in linux package in Ubuntu:
> Confirmed
> Status in linux-aws package in Ubuntu:
> Confirmed
> Status in sssd package in Ubuntu:
> Confirmed
>
> Bug description:
> After upgrading to the Ubuntu EC2 AMI from 20180126 (specifically
> ami-79873901 in us-west-2) we have seen sssd hard locking c5 and m5
> EC2 instances after starting the service and CPU goes to 100%.
>
> We do not experience this issue with t2 or c4 instance types and we do
> not see this issue on any instance types using Ubuntu Cloud images
> from 20180109 or before. I have verified that this is kernel related
> as I booted an image that we created using the Ubuntu cloud image from
> 20180109 which works fine on a c5. I then did a "apt update && apt
> install --only-upgrade linux-aws && systemctl disable sssd", rebooted
> the server, verified I was on the new kernel and started sssd with
> "systemctl start sssd" and the EC2 instance froze and Cloudwatch CPU
> usage for that instance went to 100%.
>
> I haven't been able to find much in the syslog, kern.log, journalctl
> logs, etc. The only thing I have been able to find is that when this
> happens I tend to see "^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@" in
> the syslog and sssd log files. I have attached several log files and
> the output of a "apport-bug /usr/sbin/sssd". Let me know if you need
> anything else to help track this down.
>
> Thanks,
> Paul
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/cloud-images/+bug/1746806/+subscriptions

Joseph Salisbury (jsalisbury) wrote :

I built the next test kernel, up to the following commit:
7de295e2a47849488acec80fc7c9973a4dca204e

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1746806

Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.

Paul Friel (pfriel) wrote :

@jsalisbury: I tested the kernel you provided above (commit 7de295e2a47849488acec80fc7c9973a4dca204e) and it boots fine on both a c5.xl and a m5.xl.

Kamal Mostafa (kamalmostafa) wrote :

Here's a test kernel which incorporates the 'retpoline' patch set (soon to be released in Xenial):

http://kernel.ubuntu.com/~kamal/linux-aws-rtp0/

(only the linux-image .deb is required).

My smoke test indicates that this kernel fixes the problem. Please test, and provide feedback here.

Paul Friel (pfriel) wrote :

@kamalmostafa I installed the rtp0 kernel and verified it boots fine using c5.xl and m5.xl instances with apparmor & sssd enabled.

Changed in linux (Ubuntu):
assignee: Joseph Salisbury (jsalisbury) → Kamal Mostafa (kamalmostafa)
Changed in linux-aws (Ubuntu):
assignee: Joseph Salisbury (jsalisbury) → Kamal Mostafa (kamalmostafa)
no longer affects: sssd (Ubuntu)
no longer affects: sssd (Ubuntu Xenial)
Changed in linux-aws (Ubuntu Xenial):
assignee: nobody → Kamal Mostafa (kamalmostafa)
status: New → In Progress
Changed in linux (Ubuntu Xenial):
assignee: nobody → Kamal Mostafa (kamalmostafa)
status: New → In Progress
Kamal Mostafa (kamalmostafa) wrote :

Thanks for the confirmation Paul, and thanks to all for the assistance with testing and isolation.

We will be deploying this fix (the 'retpoline' patch set) in the next version of the linux-aws kernel (4.4.0-1051.60) and the Xenial generic kernel (4.4.0-113.136).

We'll post here once those builds land in the -proposed archive.

Paul Friel (pfriel) wrote :

@kamalmostafa: Do you all have a target date for when the new linux-aws kernel (4.4.0-1051.60) will be released?

Thanks everyone for your help in quickly tracking down this issue.

Kamal Mostafa (kamalmostafa) wrote :

@Paul, I cannot provide an exact date, but it will most likely be within the next five days.

For the record, we've decided to pull in an additional group of patches so the next release linux-aws kernel will now be version (4.4.0-1052.61). I'll post here once an official build of that becomes available for verification.

Sam (chatterjee2988) wrote :

@kamalmostafa: Is the latest kernel released? Your assistance is appreciated

Changed in linux (Ubuntu Xenial):
status: In Progress → Fix Committed
Changed in linux-aws (Ubuntu Xenial):
status: In Progress → Fix Committed
Kamal Mostafa (kamalmostafa) wrote :

The linux-aws (4.4.0-1052.xx) release has been delayed by the merge of additional necessary retpoline fixes. The packages have now been built and are available in the -proposed archive; current ETA for final release of this kernel is early next week. In the meantime, the .deb package from -proposed is available here:

http://launchpadlibrarian.net/356804492/linux-image-4.4.0-1052-aws_4.4.0-1052.61_amd64.deb

Folks affected by this bug: Please verify that this kernel still resolves the problem and post your results here. Thanks for your patience!

tags: removed: kernel-key pti
Eric Heydrick (eheydrick) wrote :

I tried out the 1052 proposed kernel on a C5 instance that runs sssd/apparmor and it did not lock up.

Changed in cloud-images:
status: New → In Progress
importance: Undecided → High
Paul Friel (pfriel) wrote :

@kamalmostafa I installed that kernel package and it worked fine with sssd running on a c5.xl

Kamal Mostafa (kamalmostafa) wrote :

The fix for this issue has been released in the linux-aws and standard Ubuntu kernels. Fixed versions are:

linux-aws (4.4.0-1052.61) xenial
linux-aws (4.4.0-1014.14) trusty
linux (4.4.0-116.140) xenial

Changed in linux (Ubuntu):
status: Confirmed → Fix Released
Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
Changed in linux-aws (Ubuntu):
status: Confirmed → Fix Released
Changed in linux-aws (Ubuntu Xenial):
status: Fix Committed → Fix Released
Changed in cloud-images:
status: In Progress → Fix Released

I'm still experiencing this issue in eu-west-1 using an m5.large - machine will not boot.

Kernel is 4.4.0-1052-aws
sssd is 1.13.4
apparmor is 2.10.95

Paul Friel (pfriel) wrote :

@davidjmemmett I tested the new kernel on one c5.xl yesterday and it worked fine. Deployed the new kernel to all of our environments today and we are seeing intermittent repro of the same behavior we saw in the past (box fails to boot, no SSH available, CPU at 100%).

We reverted to the 20180109 Ubuntu AMI (kernel 4.4.0-1047.56) and it is working fine for us again.

Robie Basak (racb) wrote :

Apparently not fixed (according to original reporter), so reopening.

Changed in linux (Ubuntu):
status: Fix Released → Confirmed
Changed in linux (Ubuntu Xenial):
status: Fix Released → Confirmed
Changed in linux-aws (Ubuntu Xenial):
status: Fix Released → Confirmed
Changed in linux-aws (Ubuntu):
status: Fix Released → Confirmed
Changed in linux (Ubuntu):
status: Confirmed → In Progress
Changed in linux (Ubuntu Xenial):
status: Confirmed → In Progress
Changed in linux-aws (Ubuntu):
status: Confirmed → In Progress
Changed in linux-aws (Ubuntu Xenial):
status: Confirmed → In Progress
Jon Bishop (bishopjon) wrote :

I am having the same issue. I tested the C5 instance when it first came out and it would randomly lock up at 100% CPU. I contacted AWS support and they could not figure it out. They sent a message to the devs to see if they could. So I switched back to c4 instances. I decided to try it again yesterday and it working fine until this morning when it randomly decided to lock up at 100% CPU again. I looked at all the logs and nothing is showing me what could be causing issues. I tried multiple kernels and for me, they all seem to do it. Anyone figure this out? I would really like to use the C5.

Jon Bishop (bishopjon) wrote :

If it helps I am using Kernalcare so it pushes me to the latest version which I am seeing might not be a good idea.

This issue is not limited to EC2. I'm running artful on a Dell Optiplex 5040 with an i5-6500 CPU (which is Skylake); I've just installed sssd and hit the full system freeze. Even REISUB didn't work!

kernel 4.13.0-36-generic
sssd 1.15.3-2ubuntu1
apparmor 2.11.0-2ubuntu17.1

Kamal Mostafa (kamalmostafa) wrote :

@davidjmemmett- David, could we ask you attach a copy of your /proc/cpuinfo file from that Optiplex?

Kamal Mostafa (kamalmostafa) wrote :

Folks who can reproduce this reliably -- Please try this new 4.4.0-117.141 kernel:

https://launchpad.net/~canonical-kernel-team/+archive/ubuntu/ppa/+build/14450934
  ( linux-image-4.4.0-117-generic_4.4.0-117.141_amd64.deb )

@kamalmostafa please find attached

@kamalmostafa - I have installed the suggested linux-image-4.4.0-117-generic_4.4.0-117.141_amd64.deb and launched multiple M5 and C5 (more than 40) instances across multiple availability zones in N. Virginia and unfortunately the issue still happening with this kernel.

David Hunter (davidhun) wrote :

@kamalmostafa - Adding another datapoint: installed 4.4.0-117.141 on a vm guest on ESXi 6.0 on a host running Intel Xeon CPU E5-2695 v4 (Broadwell) and the issue still occurs. Disabling the apparmor profile for sssd allows boot to complete and restores sssd functionality.

Nivritti (wildstalker) wrote :

Im seeing this issue on Ubuntu server 16.04.3 LTS in VM's under Hyper-V.
sssd is 1.13.4-1ubuntu1.10 amd64
kernal 4.4.0-87-generic x86_64 : Stable
kernal 4.4.0-116-generic x86_64 : Unstable

crashes randomly occur during boot after completing a realm join on a stable kernel, or while performing realm join on a newer kernel.

I observed the same hard lock behavior with 16.04 HWE as well as 17.10

Sibasankar (sibbeher) on 2018-04-01
Changed in linux-aws (Ubuntu):
status: In Progress → Confirmed
Rob Johnson (rob.johnson) wrote :

This bug seems to be the closest match to what I'm experiencing. I'm hitting similar trouble on desktop/graphical Ubuntu on some Dell laptops.

Ubuntu 16.04 AMD64.

Happening on kernel versions 4.13.0-38 and 4.4.0-116.

I've tried version 4.4.0-117 as posted by @kamalmostafa - same result.

The laptops are an XPS 13 and a Precision M3520. Both machines have Kaby Lake/7th gen Core i7 CPUs.

The problem on the 4.4.0-116 kernel (on my M3520) occurred the first reboot after updating the Intel microcode. I'm starting to think this is related to the Meltdown/Spectre patching.

I started sssd in the foreground with debug set to 9 on the M3520 and netcat'd the results to another machine, so I think I have it right up until the laptop stops responding. It's obviously quite verbose, but if someone thinks it'll help, I'll post it.

From talking to some of the people on #ubuntu-server, the microcode does seem to be increasingly likely the root cause. Bug #1759920 may include relevant information.

Claus Stovgaard (frosteyes) wrote :

I can confirm the suspicion for this issue to relate to Meltdown/Spectre patching.

Running an Ubuntu 16.04 desktop in a virtual machine (VMware Workstation 12.5.9) on top of Windows 7.
The hardware is a Dell T1700 with an i7-4770, and it seems the issue arose after BIOS and Microcode update.

That's good news!

I have tested both kernels in multiple C5/M5 instances (more than 80) with SSSD and AppArmor enabled in N.Virginia, and I could not face one single failure. It seems that this new kernel version fixed the issue.

It would be good to get more feedbacks and push this kernel release a soon as possible.

Kamal Mostafa (kamalmostafa) wrote :

Thanks @walterzjunior. Those kernel versions linked in comment #50 are proceeding towards release regardless, but additional test results from other people affected by this bug would be very much appreciated.

Kamal Mostafa (kamalmostafa) wrote :

Marking this bug as a duplicate of bug #1759920, where deployment of the fix will be tracked.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers