[4.8 regression] boot has become very slow

Bug #1626436 reported by Martin Pitt on 2016-09-22
84
This bug affects 13 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
High
Joseph Salisbury
Yakkety
High
Joseph Salisbury

Bug Description

With yakkety's recent update from linux 4.4 to 4.8 booting has become a lot slower. It's not one service in particular, but without "quiet" and "splash" you can now easily read every single line instead of that whole wall of text zipping by. It now takes over 20s instead of ~10 seconds to boot.

This is even more dramatic when factoring out the recent boot hang of NetworkManager (bug 1622893) and disabling lightdm:

  sudo systemctl mask NetworkManager NetworkManager-wait-online lightdm

then booting with 4.4 takes 1.5s and with 4.8 19.5s (!).

Some excerps from systemd-analyze blame:

4.4:
           474ms postfix@-.service
           395ms lxd-containers.service
           305ms networking.service

4.8:
          4.578s postfix@-.service
          7.300s lxd-containers.service
          6.285s networking.service

I attach the full outputs of critical-chain and analyze for 4.4 and 4.8 for reference.

This is much less noticeable in the running system. There is no immediate feeling of sluggishness (although my system is by and large idle).

I compared the time of sbuilding colord under similar circumstances (-j4, building on tmpfs, thus no hard disk delays; running with fully pre-loaded apt-cacher-ng thus no random network delays), and with 4.4 it takes 6.5 minutes and with 4.8 it takes 7.5. So that got a bit slower, but much less dramatically than during boot, so this is either happening when a lot of processes run in parallel, or is perhaps related to setting up cgroups.

One thing I noticed that during sbuild in 4.8 "top" shows ridiculous loads (~ 250) under 4.8, while it's around 4 or 5 under 4.4. But that doesn't reflect in actual sluggishness, so this might be just an unrelated bug.

ProblemType: Bug
DistroRelease: Ubuntu 16.10
Package: linux-image-4.8.0-11-generic 4.8.0-11.12
ProcVersionSignature: Ubuntu 4.8.0-11.12-generic 4.8.0-rc6
Uname: Linux 4.8.0-11-generic x86_64
ApportVersion: 2.20.3-0ubuntu7
Architecture: amd64
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/pcmC0D0c: martin 3049 F...m pulseaudio
 /dev/snd/pcmC0D0p: martin 3049 F...m pulseaudio
 /dev/snd/controlC0: martin 3049 F.... pulseaudio
Date: Thu Sep 22 09:42:56 2016
EcryptfsInUse: Yes
MachineType: LENOVO 2324CTO
ProcEnviron:
 TERM=linux
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=de_DE.UTF-8
 SHELL=/bin/bash
ProcFB: 0 inteldrmfb
ProcKernelCmdLine: BOOT_IMAGE=/@/boot/vmlinuz-4.8.0-11-generic.efi.signed root=UUID=f86539b0-3a1b-4372-83b0-acdd029ade68 ro rootflags=subvol=@ systemd.debug-shell
RelatedPackageVersions:
 linux-restricted-modules-4.8.0-11-generic N/A
 linux-backports-modules-4.8.0-11-generic N/A
 linux-firmware 1.161
SourcePackage: linux
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 07/09/2013
dmi.bios.vendor: LENOVO
dmi.bios.version: G2ET95WW (2.55 )
dmi.board.asset.tag: Not Available
dmi.board.name: 2324CTO
dmi.board.vendor: LENOVO
dmi.board.version: 0B98401 Pro
dmi.chassis.asset.tag: No Asset Information
dmi.chassis.type: 10
dmi.chassis.vendor: LENOVO
dmi.chassis.version: Not Available
dmi.modalias: dmi:bvnLENOVO:bvrG2ET95WW(2.55):bd07/09/2013:svnLENOVO:pn2324CTO:pvrThinkPadX230:rvnLENOVO:rn2324CTO:rvr0B98401Pro:cvnLENOVO:ct10:cvrNotAvailable:
dmi.product.name: 2324CTO
dmi.product.version: ThinkPad X230
dmi.sys.vendor: LENOVO

Martin Pitt (pitti) wrote :
tags: added: bootspeed regression-release
Andy Whitcroft (apw) on 2016-09-22
tags: added: kernel-4.8
Martin Pitt (pitti) wrote :

Confirmed with 4.8.0-14.15-generic from yakkety-proposed.

Immediately after boot the system load is also reported as ~ 250, and then slowly goes down towards zero.

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
Martin Pitt (pitti) wrote :

I do not see this in QEMU (even with -smp 4), only on my laptop. At this point I'm not sure if it is specific to my ThinkPad X230 or happens for others too.

Changed in linux (Ubuntu):
importance: Undecided → Medium
tags: added: kernel-da-key
Daniele Dellafiore (ildella) wrote :

happens to me as well, on XPS 13 - 2013. Ubuntu always loaded before the "second dot" became orange. Now it fills up 15 dots before loading the login screen, and happened after upgrading 4.4 -> 4.8.
Everything else works ok.

Martin Pitt (pitti) wrote :

For the record, Andy already has/knows the fix for this.

Changed in linux (Ubuntu):
status: Confirmed → In Progress
Martin Pitt (pitti) wrote :

As far as I understand Andy, bug 1626564 is the root cause, thus duping.

Martin Pitt (pitti) wrote :

Unduplicating. This still happens with 4.8.0-17 from the PPA, both the slow boots and the high load times. It got a fair bit better, though (load of ~ 35 instead of ~ 250), but still a huge regression compared to 4.4.

Changed in linux (Ubuntu):
status: In Progress → Triaged
Patricia Gaughen (gaughen) wrote :

I marked the "affects me too" button at the top, but just wanted to add that we're seeing slow boots on EC2 also. With the latest yakkety AMI on ec2 (ami-1bc5820c, us-east-1, m4.large) first and subsequent boots are are much slower than expected.

Top 5 in the blame:
$ systemd-analyze blame
     2min 6.384s cloud-init.service
         14.945s cloud-init-local.service
          5.595s keyboard-setup.service
          4.079s dev-xvda1.device
          2.587s cloud-config.service

Martin Pitt (pitti) wrote :

Still confirmed with yesterday's 4.8.0-19.21.

Martin Pitt (pitti) wrote :

For the record, I tried Joseph's test bisect kernel in bug 1627108, and this does not fix it, so this is not a duplicate of bug 1627108:

4.4.0-9136 (previous yakkety): 2.849s (kernel) + 4.639s (userspace)
4.8.0-19 (current yakkety): 5.145s (kernel) + 11.825s (userspace)
4.7.0-040700rc3 (from Joseph): 3.735s (kernel) + 14.810s (userspace)

Martin Pitt (pitti) wrote :

This could potentially be related to changes with cgroups -- creating them did not cause any uevent in 4.4, but with 4.8 they do:

  sudo systemctl stop systemd-timesyncd
  udevadm monitor -k # in another terminal
  sudo systemctl start systemd-timesyncd

KERNEL[393.260769] add /kernel/slab/:atA-0000192/cgroup/dentry(1623:systemd-timesyncd.service) (cgroup)
KERNEL[393.261031] add /kernel/slab/inode_cache/cgroup/inode_cache(1623:systemd-timesyncd.service) (cgroup)
KERNEL[393.261850] add /kernel/slab/shmem_inode_cache/cgroup/shmem_inode_cache(1623:systemd-timesyncd.service) (cgroup)
KERNEL[393.262358] add /kernel/slab/:tA-0000192/cgroup/cred_jar(1623:systemd-timesyncd.service) (cgroup)
KERNEL[393.262636] add /kernel/slab/proc_inode_cache/cgroup/proc_inode_cache(1623:systemd-timesyncd.service) (cgroup)
KERNEL[393.452990] add /kernel/slab/:tA-0001024/cgroup/mm_struct(1623:systemd-timesyncd.service) (cgroup)
KERNEL[393.453082] add /kernel/slab/:tA-0000200/cgroup/vm_area_struct(1623:systemd-timesyncd.service) (cgroup)
KERNEL[393.453251] add /kernel/slab/:tA-0000064/cgroup/anon_vma_chain(1623:systemd-timesyncd.service) (cgroup)
KERNEL[393.453369] add /kernel/slab/anon_vma/cgroup/anon_vma(1623:systemd-timesyncd.service) (cgroup)
KERNEL[393.456909] add /kernel/slab/sock_inode_cache/cgroup/sock_inode_cache(1623:systemd-timesyncd.service) (cgroup)
KERNEL[393.457974] add /kernel/slab/:t-0000256/cgroup/kmalloc-256(1623:systemd-timesyncd.service) (cgroup)
KERNEL[393.458205] add /kernel/slab/:t-0000512/cgroup/kmalloc-512(1623:systemd-timesyncd.service) (cgroup)
KERNEL[393.460718] add /kernel/slab/:tA-0003648/cgroup/task_struct(1623:systemd-timesyncd.service) (cgroup)
KERNEL[393.462292] add /kernel/slab/:tA-0000128/cgroup/pid(1623:systemd-timesyncd.service) (cgroup)
KERNEL[393.465448] add /kernel/slab/:t-0001024/cgroup/kmalloc-1024(1623:systemd-timesyncd.service) (cgroup)

Which means that there will be a corresponding number of udev workers running. At boot there is a *lot* of cgroup action due to lots of services being started, while this is relatively quiet during runtime. This is consistent with a slow boot but normal behaviour/feeling at runtime.

Florian W. (florian-will) wrote :

Same here, downgrading to 4.4 from Xenial improved the situation (boot still feels a little slower than Xenial though, maybe because of the NetworkManager bug). I use a 7 years old desktop computer based on a Gigabyte motherboard and AMD Phenom II X4 955 CPU, So it's not Thinkpad-related.

(From GRUB to lightdm used to be 7 seconds in 16.04, but using Kernel 4.8 and the new network manager it feels more like 20 secs. Also, the Unity 7+Chromium launch used to take just very few seconds, now it's a noticable delay with only the wallpaper showing, no matter which kernel I use, so that's unrelated. Anyway, booting my computer has just become a lot less fun even though I removed Apache 2 and Tomcat. :( )

Martin Pitt (pitti) wrote :

Confirmed with 4.8.0-040800.201610022031 from http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.8/

Florian W. (florian-will) wrote :

This is what I get right after boot and login into Unity 7 (executed in Gnome Terminal):
$ ps aux | grep kworker | wc -l
1004

Kernel is 4.8.0-22-generic. After a few minutes, only 21 kworkers remain. Load after 4 minutes is not excessive (1244 sleeping processes), forgot to check load right after boot:
up 4 min, 1 user, load average: 1,01, 1,79, 0,94

I'm not sure if it is of any help, but I've attached the output of "ps aux | grep kworker".

Martin Pitt (pitti) on 2016-10-20
description: updated
Joseph Salisbury (jsalisbury) wrote :

I'd like to perform a bisect to figure out what commit caused this regression. We need to identify the earliest kernel where the issue started happening as well as the latest kernel that did not have this issue.

Can you test the following kernels and report back? We are looking for the first kernel version that exhibits this bug:

4.5 final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.5-wily/
4.6 final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.6-yakkety/
4.7 final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.7/
4.8-rc1: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.8-rc1/

You don't have to test every kernel, just up until the kernel that first has this bug.

Thanks in advance!

Changed in linux (Ubuntu Yakkety):
importance: Undecided → Medium
status: New → Triaged
tags: added: performing-bisect
Joseph Salisbury (jsalisbury) wrote :

Thanks for testing. We need to narrow it down a bit more by testing some of the v4.7 release candidates. Can you test v4.7-rc1? It can be downloaded from:

4.7-rc1: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.7-rc1-yakkety/

Martin Pitt (pitti) wrote :

> 4.7-rc1: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.7-rc1-yakkety/

bad

Thanks Joseph for taking this!

Florian W. (florian-will) wrote :

My problem is probably the same that Doug Smythies described in bug 1626564 . I used the "echo workqueue:workqueue_queue_work > /sys/kernel/debug/tracing/set_event" method and had about 1000 work items enqueued by "memcg_kmem_cache_create_func" all at pretty much the same time, which correlates nicely with the 1000 kworker processes I see after boot.

If Martin has the same issue with lots of kworkers (I'm still not quite sure?), this might be the same issue.

So I suspect the bad commit might be:
81ae6d03952c1bfb96e1a716809bd65e7cd14360
"mm/slub.c: replace kick_all_cpus_sync() with synchronize_sched() in kmem_cache_shrink()"
according to https://bugzilla.kernel.org/show_bug.cgi?id=172991

This commit was released in Linux 4.7.

This patch might fix the issue, but I haven't tested it:
https://patchwork.kernel.org/patch/9359271/

Florian W. (florian-will) wrote :

Well, I guess the other patch is required as well for SLUB (not only SLAB as I assumed at first):
https://patchwork.kernel.org/patch/9361853/

Joseph Salisbury (jsalisbury) wrote :

I'll build a test kernel with those two patches and post a link to it shortly.

Joseph Salisbury (jsalisbury) wrote :

I built a Yakkety test kernel with those two patches. It can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1626436/

Can you test this kernel and see if it resolves this bug?

Martin Pitt (pitti) wrote :

Thanks Florian and Joseph! With the two patches it's definitively a lot better! Boot times without NM and lightdm:

4.4: 1.5s
4.8.0-22 (zesty): 16.8s
4.8.0-26 (Joseph's kernel): 8.4s

So still quite far from what we used to have, but already twice as fast as the current y/z kernel.

Florian W. (florian-will) wrote :

Thanks Joseph! I'm seeing the same effect described by Martin. Boot+lightdm+loading Unity 7 is not as fast as it used to be in 16.04.

I need to compare timings (4.8 patched vs 4.8 vs 4.4) tomorrow to make sure, but I also feel like it's a little faster now with the patches applied.

The number of kworker processes after boot has reduced drastically using the patched kernel, it's a sane number now (~30), not 1004 like when using the unpatched 4.8 kernel.

Florian W. (florian-will) wrote :

My boot time comparison: (Wall clock time from GRUB to lightdm ready for login)
4.4: 12.28s
4.8.0-26 + Patches (Joseph's kernel): 21.78s
4.8.0-22 (Yakkety): 25.06s

This is with NetworkManager-wait-online.service enabled, which seems to take ~8s in my case.

The attached file has systemd-analyze (blame|critical-chain) output for each boot. Seems like slowdowns are all over the place, with some exceptions like nmbd. nmbd takes 5.1s no matter which kernel I use.

And apparmor.service behaves opposite to boot time in this case: When boot is slower, apparmor.service loads faster.

Loading times after typing my password into lightdm appear to correlate with boot times, but I haven't measured them. Noticably slower with 4.8 compared to 4.4.

I don't have enough time to learn how to compile Ubuntu kernel packages and start a bisection right now, but maybe I can start an attempt next week. My CPU is old / slow though, so that will probably take some time.

Scott Moser (smoser) wrote :

So, as another datapoint, I have built the same package with sbuild on 4.4 and 4.8.

I grabbed 4.4 from xenial, rebooted into it. built the same trival package that i built yesterday on 4.8
$ ( cd /home/smoser/ubuntu/logs/ && grep "^Build needed" smello_0.4~ppa1_amd64-2016-10-2*)
smello_0.4~ppa1_amd64-2016-10-20T16:48:33Z.build:Build needed 00:02:34, 120k disc space
smello_0.4~ppa1_amd64-2016-10-21T18:31:17Z.build:Build needed 00:00:32, 120k disc space

So the difference there is that it took 32 seconds to build the package on 4.4 and 154 seconds on 4.8. I've not done any more investigation as to what all is slow, load does show high on 4.8 when doing an sbuid, and this isn't a matter of a download taking longer.

Joseph Salisbury (jsalisbury) wrote :

@Florian, I can built the test kernels for the bisect. Per comments #17 and #19, we know that 4.6 is good and 4.7-rc1 is bad, so I'll bisect between those two versions. I'll build the first test kernel and post it shortly.

Changed in linux (Ubuntu):
importance: Medium → High
Changed in linux (Ubuntu Yakkety):
importance: Medium → High
Changed in linux (Ubuntu):
status: Triaged → In Progress
Changed in linux (Ubuntu Yakkety):
status: Triaged → In Progress
Changed in linux (Ubuntu):
assignee: nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu Yakkety):
assignee: nobody → Joseph Salisbury (jsalisbury)
Joseph Salisbury (jsalisbury) wrote :

I started a kernel bisect between v4.6 final and v4.7-rc1. The kernel bisect will require testing of about 7-10 test kernels.

I built the first test kernel, up to the following commit:
7afd16f882887c9adc69cd1794f5e57777723217

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1626436

Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.

Thanks in advance

Florian W. (florian-will) wrote :

Thanks Joseph!

I now believe I'm seeing two different issues in 4.8 that are not in 4.4 and slow down my boot, which is very confusing.
- #1: High number of kworkers (fixed in 4.8 using the two patches and bisected by Doug Smythies already)
- #2: Unknown other issue. I've noticed that when compiling a simple cmake/ninja project using 6 compiler processes on my rather old 4-core CPU, some kernels show odd behaviour that might also slow down the boot process.

When compiling that cmake project on 4.4, 4.6-7afd16f or 4.7-rc1, issue #2 does not exist. While compilation takes place on 4.8 (no matter if kworker patches are applied or not), my mouse cursor becomes sluggish and it takes a long time to just switch to another window in Unity 7. Compilation speed is unaffected though, so maybe just some kind of task switching problem, not a CPU utilization issue.

So that makes it quite difficult to tell if any given kernel is good or bad by just looking at the boot time. As far as I can tell, this is the current list of kernels and the issues they have, where #2 was detected using the cmake compilation behaviour and #1 by the number of kworker processes after boot (should be ~35):

4.8 + patches: #2
4.8: #1, #2
4.7-rc1: #1
4.6-7afd16f: none
4.4: none

4.8 has about the same wall clock boot time as 4.7-rc1 in my case, while 4.8+patches is ~5 seconds faster, but that's hard to tell, since boot times vary by 2-3 seconds anyway. So #1 apparently dominates the slow down. I've only been able to get a "perfect" <10sec boot time from GRUB to lightdm with 4.4 and 4.6-7afd16f. So the order of boot speed is like this right now for me:

4.4|4.6-7afd16f < 4.8+patches < 4.8|4.7-rc1

Sorry for the long-winded comment, I just feel uncomfortable calling this kernel "good" or "bad" without explaining why 4.7-rc1 might actually be "good" on my system concerning issue #2. 4.6-7afd16f appears to be as "good" as 4.4 in any case.

Joseph Salisbury (jsalisbury) wrote :

Thanks for the update, Florian. Can you also test the kernel posted in comment #29, Martin? If you also see good boot speed, I'll mark that kernel as good and build the next one.

Martin Pitt (pitti) wrote :

> I built the first test kernel, up to the following commit:
> 7afd16f882887c9adc69cd1794f5e57777723217

Like Florian, the result isn't unambiguous. It for sure boots much faster than 4.7/4.8 (~ 3s kernel + 5s userspace), but still much slower than 4.4. But this could be attributed to the #1 issue above. So I *think* this kernel is still "good" (but not quite as clearly as for Florian).

If it isn't too much trouble, would it be possible to apply the two patches on every bisected kernel to make the results more distinct?

Joseph Salisbury (jsalisbury) wrote :

Yes, I'll rebuild that test kernel with the two patches. The second patched does not apply cleanly due to where the bisect is in time currently.

In the meantime, it might be worthwhile to test v4.9-rc2. It's available from:
http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.9-rc2/

Joseph Salisbury (jsalisbury) wrote :

I built the another test kernel, up to the same commit:
7afd16f882887c9adc69cd1794f5e57777723217

This kernel also has the two patches.

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1626436

Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.

Thanks in advance

Martin Pitt (pitti) wrote :

> In the meantime, it might be worthwhile to test v4.9-rc2. It's available from:
> http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.9-rc2/

Still slow boot/high load, i. e. BAD.

> I built the another test kernel, up to the same commit:
> 7afd16f882887c9adc69cd1794f5e57777723217

That's GOOD.

Thanks!

Joseph Salisbury (jsalisbury) wrote :

I built the another test kernel, up to the same commit:
a7fd20d1c476af4563e66865213474a2f9f473a4

This kernel also has the two patches.

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1626436/a7fd20d1c476af4563e66865213474a2f9f473a4

Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.

Thanks in advance

Martin Pitt (pitti) wrote :

> I built the another test kernel, up to the same commit: a7fd20d1c476af4563e66865213474a2f9f473a4

This is GOOD.

Joseph Salisbury (jsalisbury) wrote :

I built the another test kernel, up to the same commit:
4a5219edcdae52bfb5eea0dfc2a7bd575961dad7

This kernel also has the two patches.

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1626436/4a5219edcdae52bfb5eea0dfc2a7bd575961dad7

Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.

Thanks in advance

Joseph Salisbury (jsalisbury) wrote :

Can you see the files now?

Martin Pitt (pitti) wrote :

> http://kernel.ubuntu.com/~jsalisbury/lp1626436/4a5219edcdae52bfb5eea0dfc2a7bd575961dad7

I do see the files now. Sorry for the delay, travel/sprint and all.

This is a GOOD kernel.

Joseph Salisbury (jsalisbury) wrote :

I built the next test kernel, up to the same commit:
0efacbbaee1e94e9942da0912f5b46ffd45a74bd

This kernel also has the two patches.

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1626436/0efacbbaee1e94e9942da0912f5b46ffd45a74bd

Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.

Thanks in advance

Martin Pitt (pitti) wrote :

> The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1626436/0efacbbaee1e94e9942da0912f5b46ffd45a74bd

This is a GOOD kernel.

Joseph Salisbury (jsalisbury) wrote :

I built the next test kernel, up to the commit:
07b75260ebc2c789724c594d7eaf0194fa47b3be

This kernel also has the two patches.

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1626436/07b75260ebc2c789724c594d7eaf0194fa47b3be

Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.

Thanks in advance

Joseph Salisbury (jsalisbury) wrote :

I built the next test kernel, up to the commit:
a0d3c7c5c07cfbe00ab89438ddf82482f5a99422

This kernel also has the two patches.

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1626436/a0d3c7c5c07cfbe00ab89438ddf82482f5a99422

Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.

Thanks in advance

Martin Pitt (pitti) wrote :

> http://kernel.ubuntu.com/~jsalisbury/lp1626436/a0d3c7c5c07cfbe00ab89438ddf82482f5a99422

Still GOOD.

I'm becoming a bit nervous, but I just re-tested 4.7rc1 again and it's still bad.

Joseph Salisbury (jsalisbury) wrote :

I built the next test kernel, up to the commit:
1acd010152138644f63d743e165161edc780fc32

This kernel also has the two patches.

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1626436/1acd010152138644f63d743e165161edc780fc32

Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.

Thanks in advance

Joseph Salisbury (jsalisbury) wrote :

I built the next test kernel, up to the commit:
58f8b094e96f12e899bf767fc658c165908065d4

This kernel also has the two patches.

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1626436/58f8b094e96f12e899bf767fc658c165908065d4

Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.

Thanks in advance

Joseph Salisbury (jsalisbury) wrote :

I built the next test kernel, up to the commit:
e257ef55ce51d7ec399193ee85acda8b8759d930

This kernel also has the two patches.

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1626436/e257ef55ce51d7ec399193ee85acda8b8759d930

Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.

Thanks in advance

Joseph Salisbury (jsalisbury) wrote :

I built the next test kernel, up to the commit:
caee57ec71e1a0cb50b6028c706cfe541dcb080a

This kernel also has the two patches.

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1626436/caee57ec71e1a0cb50b6028c706cfe541dcb080a

Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.

Thanks in advance

Joseph Salisbury (jsalisbury) wrote :

Ignore my last test kernel, I forgot to add the two patches. I'll build a new one.

Joseph Salisbury (jsalisbury) wrote :

Since the mainline kernel is now at v4.9-rc5, it might be worth giving it a test. If it fixes the boot slowness, time would be better spent looking for the fix, versus continuing a search on what broke it.

V4.9-rc5 can be downloaded from:
http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.9-rc5/

Martin Pitt (pitti) wrote :

> http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.9-rc5/

This still shows unchanged behaviour: slow boot and shutdown, and high load after boot. So unfortunately it didn't magically fix itself :(

Martin Pitt (pitti) wrote :

Is there something which could help to cut down the bisect? I had a theory in comment 12 about the plethora of cgroup related uevents.

Joseph Salisbury (jsalisbury) wrote :

We only have about 4 more kernels to test. We could try making changes to the changes with cgroups. We could even revert some of them. However, it could be hit or miss. A bisect will tell us exactly which commit started this, and then we can examine that commit to know what is going on.

Just as a sanity check, I also built the v4.9-rc5 kernel with the two commits we've been adding for every bisected kernel. Can you test that kernel as well? Then I'll continue with the last few test kernels for the bisect.

That mainline test kernel can be downloaded from:

Martin Pitt (pitti) wrote :

ah, just 4 sounds fine indeed. I did test 4.9-rc5 yesterday already (see comment #57), just not sure if that includes "the" the two commits.

Joseph Salisbury (jsalisbury) wrote :

The kernel in #57 did not include the patches, but the kernel in comment #59 is current mainline and it DOES have the patches.

I noticed the comment does not have the download link. It is:
http://kernel.ubuntu.com/~jsalisbury/lp1626436/mailine-with-two-commits/

I just wanted to confirm the mainline kernel still exhibits the bug both with and without the patches.

Martin Pitt (pitti) wrote :

> I noticed the comment does not have the download link. It is:
> http://kernel.ubuntu.com/~jsalisbury/lp1626436/mailine-with-two-commits/

Bazinga! Perhaps unexpectedly, this is a GOOD kernel -- boot speed is comparable with 4.4, there is no high load after booting any more, and shutdown is fast as well.

Thanks!

Joseph Salisbury (jsalisbury) wrote :

Yes, that result is unexpected. So it sounds like v4.9-rc5 without the patches is still bad, but adding the two patches fixes the bug.

I built another -proposed Yakkety test kernel with those two patches. It can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1626436/yakkety-proposed-patched

It may be that 4.9-rc5 has additional commits that are also needed. If this Yakkety test kernel is still bad, I'll review what additional commits are in -rc5.

Martin Pitt (pitti) wrote :

> http://kernel.ubuntu.com/~jsalisbury/lp1626436/yakkety-proposed-patched

This is a GOOD kernel as well -- booting is almost as fast as with 4.4, and the load doesn't explode under sbuild either.

I'm a bit confused now, I thought we already tried the two patches in isolation -- but I think *only* against 4.6 and/or 4.7, not against 4.8. But so much the better. ☺

Thanks for your efforts Joseph, much appreciated!

Joseph Salisbury (jsalisbury) wrote :

We did in fact test the yakkety kernel with those two patched on Oct 21, per comments 23 and 24.

However, the test kernel I built with those two patches was 4.8.0-26. The last 4.8 based test kernel I posted was built with the two patches against Yakkety -proposed. That kernel is now at 4.8.0-29.

That would suggest that another fix came in with the stable updates that in addition to those two patches fixes the bug.

The next step is for me to check with upstream on the current status of the two patches. I'll see when they are expected to land in mainline, so we can cherry pick them into Yakkety.

I'll investigate further and post and update.

Joseph Salisbury (jsalisbury) wrote :

The patches we testing have now landed in mainline. Does this bug still happen on Yakkety with the latest updates?

Tomas Janousek (lis) wrote :

Are you guys aware of any other patches that went to mainline and may affect this? I've been hit by this issue since 4.6/4.7-ish days and can still reproduce it with 4.9.7. The two patches do help, but it's far from perfect:

4.9.7:
Startup finished in 3.498s (kernel) + 20.462s (userspace) = 23.961s

4.9.7 + https://github.com/torvalds/linux/commit/89e364db71fb5e7fc8d93228152abfa67daf35fa + https://github.com/torvalds/linux/commit/13583c3d3224508582ec03d881d0b68dd3ee8e10:
Startup finished in 3.518s (kernel) + 13.027s (userspace) = 16.546s

4.4.28:
Startup finished in 2.698s (kernel) + 11.761s (userspace) = 14.459s

There's consistently a ~second difference in kernel and another ~second in userspace. It's not a statistical error, I checked. I think it is even more pronounced in shutdown, but I don't have exact measurements from that.

(I don't use a Ubuntu kernel so I apologize for being slightly off-topic. If you're aware of an upstream issue tracking this, feel free to refer me to it.

Dimitrij Mijoski (dimztimz) wrote :

I had slow boot with 16.10 few months ago. I moved back to 16.04. Today I had time to test some of the kernels above from the bisection process. These are the results

4.4.0-59-generic # Ubuntu 16.04 official January 2017
Startup finished in 4.889s (kernel) + 28.001s (userspace) = 32.890s

4.6.0-040600-generic # from comment 29 ... actually from #34
Startup finished in 5.274s (kernel) + 23.561s (userspace) = 28.836s

4.6.0-040600-generic # comment 36
Startup finished in 5.111s (kernel) + 25.192s (userspace) = 30.304s

4.6.0-040600rc1-generic # comment 54
Startup finished in 4.979s (kernel) + 21.904s (userspace) = 26.883s

4.8.0-26-generic # official Ubuntu 16.10 at November 2016
Startup finished in 15.025s (kernel) + 35.067s (userspace) = 50.093s

4.8.0-29-generic # comment 61
Startup finished in 14.808s (kernel) + 32.516s (userspace) = 47.325s
Startup finished in 15.674s (kernel) + 26.293s (userspace) = 41.967s

4.8.0-34-generic # Official 16.04 repo, apt install linux-generic-hwe-16.04-edge
Startup finished in 15.717s (kernel) + 26.583s (userspace) = 42.301s

Basically the two patches on my (old) laptop reduce the total boot from 50 to 40 something. Obviously there is another issue. On 4.8 kernels (with or without the two patches) I get a short freeze of 1-2 secs when I open a window of System Settings.

I'd suggest we should continue the bisection that Joseph Salisbury started. We should continue after the kernel in comment 54 which is GOOD.

Dimitrij Mijoski (dimztimz) wrote :

I decided to test with some of the mainline kernel images posted at http://kernel.ubuntu.com/~kernel-ppa/mainline/

4.6.4
Startup finished in 4.991s (kernel) + 23.504s (userspace) = 28.496s

4.6.6
Startup finished in 5.019s (kernel) + 24.044s (userspace) = 29.064s

4.6.7
Startup finished in 5.063s (kernel) + 24.008s (userspace) = 29.072s

4.7.0rc1
Startup finished in 6.260s (kernel) + 24.199s (userspace) = 30.460s
--------------------^^^^^^
--------------------we see the first small jump here in 4.7rc1

4.7.0rc7
Startup finished in 6.800s (kernel) + 22.748s (userspace) = 29.548s

4.7.10
Startup finished in 5.718s (kernel) + 23.185s (userspace) = 28.904s
Startup finished in 6.632s (kernel) + 22.865s (userspace) = 29.498s

4.8.0rc1
Startup finished in 17.619s (kernel) + 23.272s (userspace) = 40.892s
Startup finished in 17.364s (kernel) + 29.914s (userspace) = 47.278s
--------------------^^^^^^
--------------------first large jump in 4.8rc1

4.8.0
Startup finished in 16.631s (kernel) + 23.861s (userspace) = 40.493s

Dimitrij Mijoski (dimztimz) wrote :

For completeness, the latest mainline.

4.10.0rc6
Startup finished in 15.387s (kernel) + 25.344s (userspace) = 40.731s

We can definitely conclude something happens between 4.7.10 and 4.8.0rc1. Detailed bisection should happen there.

Tomas Janousek (lis) wrote :
Download full text (3.7 KiB)

So I did a little bisecting today on mainline x86_64 kernels with CONFIG_SLUB:

v4.4 Startup finished in 2.718s (kernel) + 11.831s (userspace) = 14.549s
v4.5 Startup finished in 2.707s (kernel) + 12.777s (userspace) = 15.484s
v4.6-rc4-1018-g5a5f0792e98b Startup finished in 2.970s (kernel) + 13.641s (userspace) = 16.611s
v4.6-3615-g1eccc6e1529e Startup finished in 3.005s (kernel) + 14.063s (userspace) = 17.069s
v4.6-5396-g07b75260ebc2 Startup finished in 3.044s (kernel) + 14.932s (userspace) = 17.977s
v4.6-5931-gf4c80d5a16eb Startup finished in 3.057s (kernel) + 13.429s (userspace) = 16.487s
v4.6-6084-g78975f23cba0 Startup finished in 3.050s (kernel) + 13.903s (userspace) = 16.954s
v4.6-5050-g18726ca8b34b Startup finished in 3.015s (kernel) + 13.510s (userspace) = 16.525s
v4.6-5055-g6052b7880a95 Startup finished in 3.029s (kernel) + 13.519s (userspace) = 16.549s
v4.6-5058-g76b342bdc71b Startup finished in 2.998s (kernel) + 14.179s (userspace) = 17.177s
v4.6-5060-g801faf0db894 Startup finished in 3.064s (kernel) + 14.667s (userspace) = 17.732s
v4.6-5061-g81ae6d03952c Startup finished in 3.057s (kernel) + 18.357s (userspace) = 21.414s
v4.6-5073-g9cc3387fa29f Startup finished in 3.073s (kernel) + 17.976s (userspace) = 21.049s
v4.6-5119-g682a3385e773 Startup finished in 3.089s (kernel) + 17.987s (userspace) = 21.077s
v4.6-6266-ga05a70db34ba Startup finished in 3.037s (kernel) + 18.550s (userspace) = 21.588s
v4.6-7135-g0eff4589c36e Startup finished in 3.008s (kernel) + 18.074s (userspace) = 21.082s
v4.7-rc5-1395-g877fa9a42ddc Startup finished in 3.179s (kernel) + 19.147s (userspace) = 22.327s
v4.8 Startup finished in 3.197s (kernel) + 20.836s (userspace) = 24.034s
v4.9.7 Startup finished in 3.674s (kernel) + 20.034s (userspace) = 23.708s
v4.9.7 + the two patches Startup finished in 3.518s (kernel) + 13.027s (userspace) = 16.546s

What we can observe from this:

The big bump came in https://github.com/torvalds/linux/commit/81ae6d03952c1bfb96e1a716809bd65e7cd14360 (or alternatively in https://github.com/torvalds/linux/commit/801faf0db8947e01877920e848a4d338dd7a99e7 for CONFIG_SLAB, if bz.k.o is trustworthy). This is likely fixed by https://github.com/torvalds/linux/commit/86d9f48534e800e4d62cdc1b5aaf539f4c1d47d6, https://github.com/torvalds/linux/commit/13583c3d3224508582ec03d881d0b68dd3ee8e10 and https://github.com/torvalds/linux/commit/89e364db71fb5e7fc8d93228152abfa67daf35fa. The last two are not in stable, though.

Speaking of https://github.com/torvalds/linux/commit/86d9f48534e800e4d62cdc1b5aaf539f4c1d47d6, this went into v4.8.7 as https://github.com/torvalds/linux/commit/68adb469dd3b28513c5b548df3437f2bc1a83e43, so that may be the fix you were looking for in comment #65.

There are also lots of small performance regressions here and there that may be difficult to bisect. A second in userspace between v4.4 and v4.5 and another second between v4.5 and v4.6. Some 300 ms in kernel between v4.5 and v4.6, 200 ms between v4.6 and v4.7 and another 300 ms between v4.8 and v4.9. That's here, ...

Read more...

Dimitrij Mijoski (dimztimz) wrote :

I booted to 4.8 again and saw the dmesg.

This is the error:
[ 3.375735] ...
[ 12.768084] [drm:drm_atomic_helper_commit_cleanup_done [drm_kms_helper]] *ERROR* [CRTC:26:pipe A] flip_done timed out

Apparently it is a known problem https://bbs.archlinux.org/viewtopic.php?id=218581 related to the Intel video driver in 4.8.

Colin Ian King (colin-king) wrote :

The fix for bug 1649905 should have helped impove this issue.

Plurtu (plurtu) wrote :

Still getting slow boot and hangs with latest 4.8.0-37.39.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.