Reboot into linux 4.11.0-13 after update caused 100% cpu on btrfs-cleaner/btrfs-transaction

Bug #1710653 reported by Tiago Stürmer Daitx
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Expired
Medium
Unassigned

Bug Description

After doing no updates for a while I did a big update after the archive transitions.

On the first boot I got btrfs-cleaner stuck at 100% CPU, sometimes changing place with btrfs-transaction. Atop does not report a high IO usage, only high CPU. The computer gets super slow with this.

PRC | sys 10.13s | user 0.56s | #proc 249 | #zombie 2 | #exit 1 |
CPU | sys 101% | user 5% | irq 0% | idle 293% | wait 0% |
cpu | sys 83% | user 0% | irq 0% | idle 17% | cpu000 w 0% |
cpu | sys 18% | user 2% | irq 0% | idle 81% | cpu001 w 0% |
cpu | sys 0% | user 2% | irq 0% | idle 98% | cpu002 w 0% |
cpu | sys 0% | user 2% | irq 0% | idle 98% | cpu003 w 0% |
CPL | avg1 4.23 | avg5 4.94 | avg15 3.90 | csw 14872 | intr 10422 |
MEM | tot 19.1G | free 13.8G | cache 3.0G | buff 15.4M | slab 549.2M |
SWP | tot 6.0G | free 6.0G | | vmcom 4.6G | vmlim 15.5G |
LVM | root | busy 0% | read 1 | write 0 | avio 0.00 ms |
DSK | sda | busy 0% | read 1 | write 0 | avio 0.00 ms |
NET | transport | tcpi 8 | tcpo 5 | udpi 7 | udpo 6 |
NET | network | ipi 15 | ipo 11 | ipfrw 0 | deliv 15 |
NET | wlan0 0% | pcki 14 | pcko 10 | si 1 Kbps | so 1 Kbps |
NET | lo ---- | pcki 2 | pcko 2 | si 0 Kbps | so 0 Kbps |

   PID SYSCPU USRCPU VGROW RGROW RDDSK WRDSK ST EXC S CPUNR CPU CMD 1/1
   423 8.95s 0.00s 0K 0K 16K 384K -- - R 0 90% btrfs-transact

Collected data with apport-cli, hopefully this is worth something. Will reboot and check how it goes.

ProblemType: Bug
DistroRelease: Ubuntu 17.10
Package: linux-image-4.11.0-13-generic 4.11.0-13.19
ProcVersionSignature: Ubuntu 4.11.0-13.19-generic 4.11.12
Uname: Linux 4.11.0-13-generic x86_64
AlsaVersion: Advanced Linux Sound Architecture Driver Version k4.11.0-13-generic.
ApportVersion: 2.20.6-0ubuntu5
Architecture: amd64
ArecordDevices:
 Home directory not accessible: Permission denied
 **** List of CAPTURE Hardware Devices ****
 card 1: PCH [HDA Intel PCH], device 0: CX20751/2 Analog [CX20751/2 Analog]
   Subdevices: 1/1
   Subdevice #0: subdevice #0
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC0: tdaitx 2812 F.... pulseaudio
 /dev/snd/controlC1: tdaitx 2812 F.... pulseaudio
Card0.Amixer.info:
 Card hw:0 'HDMI'/'HDA Intel HDMI at 0xf711c000 irq 49'
   Mixer name : 'Intel Broadwell HDMI'
   Components : 'HDA:80862808,80860101,00100000'
   Controls : 35
   Simple ctrls : 5
Card1.Amixer.info:
 Card hw:1 'PCH'/'HDA Intel PCH at 0xf7118000 irq 48'
   Mixer name : 'Conexant CX20751/2'
   Components : 'HDA:14f1510f,1043183d,00100100'
   Controls : 20
   Simple ctrls : 9
Date: Mon Aug 14 11:50:30 2017
HibernationDevice: #RESUME=/dev/mapper/asus_vg-swap
Lsusb:
 Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
 Bus 001 Device 004: ID 8087:0a2a Intel Corp.
 Bus 001 Device 003: ID 03eb:8b06 Atmel Corp.
 Bus 001 Device 002: ID 064e:9700 Suyin Corp. Asus Integrated Webcam
 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
MachineType: ASUSTeK COMPUTER INC. UX303LAB
ProcFB: 0 inteldrmfb
ProcKernelCmdLine: root=/dev/mapper/root rootflags=subvol=@,compress ro i915.modeset=1 video=i915 quiet splash cryptopts=target=root,source=UUID=36b37f43-e457-4be2-becc-4a49ca1d5b71 fbcon=scrollback:8192k
PulseList:
 Error: command ['pacmd', 'list'] failed with exit code 1: Home directory not accessible: Permission denied
 No PulseAudio daemon running, or not running as session daemon.
RelatedPackageVersions:
 linux-restricted-modules-4.11.0-13-generic N/A
 linux-backports-modules-4.11.0-13-generic N/A
 linux-firmware 1.167
SourcePackage: linux
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 12/11/2014
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: UX303LAB.203
dmi.board.asset.tag: ATN12345678901234567
dmi.board.name: UX303LAB
dmi.board.vendor: ASUSTeK COMPUTER INC.
dmi.board.version: 1.0
dmi.chassis.asset.tag: No Asset Tag
dmi.chassis.type: 10
dmi.chassis.vendor: ASUSTeK COMPUTER INC.
dmi.chassis.version: 1.0
dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvrUX303LAB.203:bd12/11/2014:svnASUSTeKCOMPUTERINC.:pnUX303LAB:pvr1.0:rvnASUSTeKCOMPUTERINC.:rnUX303LAB:rvr1.0:cvnASUSTeKCOMPUTERINC.:ct10:cvr1.0:
dmi.product.name: UX303LAB
dmi.product.version: 1.0
dmi.sys.vendor: ASUSTeK COMPUTER INC.

Revision history for this message
Tiago Stürmer Daitx (tdaitx) wrote :
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.13 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.13-rc4

Changed in linux (Ubuntu):
importance: Undecided → Medium
status: Confirmed → Incomplete
Revision history for this message
Tiago Stürmer Daitx (tdaitx) wrote :

FYI I was using 4.10.0-26 for the past days before the 400+ packages upgrade in artful. I am using snapper to get periodic system snapshots (pre/post apt install|remove and also hourly/daily/weekly).

The actual behavior was that a short time after boot the btrfs cleaner/transaction process got permanently stuck to 100% in a single CPU. After that the system cycled between somewhat responsive to unresponsive every couple/few minutes. Unresponsiveness varied:
- a terminal command was unresponsive and only finished later on
- no screen refresh in a workspace, but I was able to cycle between workspaces
- sometimes even X got unresponsive (not even a cursor)

tl;dr I finally got it to work by disabling btrfs quotas on the root subvolume. I did enable it afterward and had no problems so far.

For the sake of keeping some history on what I did in between, here are the steps I tried:

1) Waited for a couple hours in case btrfs was stuck due to too many snapshots.
I am using snapper to get periodic system snapshots and I did notice that apt-get install/remove was getting slight slower in the past days under 4.10.0-26.

2) Waiting in step #1 didn't help, so I rebooted into 4.11.0-10, still the same issue.

3) I thought about rebooting into 4.10.0-26, but then I read one report about btrfs snapshots changes in the kernel causing the slowdown on netgear NAS devices and it was not possible to downgrade due to those changes.
There was no information what kernel version was that and - since I had to browse from the phone - whether there were any such changes between 4.10 to 4.11 so I decided to avoid downgrading just in case.

4) Rebooted into 4.11.0-13 again, still the same issue - as expected.

5) I had around 40 snapshots between pre/post apt installs and hourly/daily/weekly snapshots from snapper, so I set a lower threshold and had snapper drop the oldest ones. Didn't help.

6) On the same netgear page I saw some users reporting that disabling btrfs quota helped (some said it didn't).
Tried that on the root subvolume and it helped - I have various subvolumes, but it worked on the first try for root. I did enable quotas again to check if the btrfs cleaner/transaction was going to misbehave again, but everything is still running smooth. Didn't try rebooting yet - I will report back if rebooting turns out to be an issue.

Revision history for this message
Tiago Stürmer Daitx (tdaitx) wrote : Re: [Bug 1710653] Re: Reboot into linux 4.11.0-13 after update caused 100% cpu on btrfs-cleaner/btrfs-transaction

On Mon, Aug 14, 2017 at 4:06 PM, Joseph Salisbury
<email address hidden> wrote:
> Would it be possible for you to test the latest upstream kernel? Refer
> to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest
> v4.13 kernel[0].
>
> If this bug is fixed in the mainline kernel, please add the following
> tag 'kernel-fixed-upstream'.
>
> If the mainline kernel does not fix this bug, please add the tag:
> 'kernel-bug-exists-upstream'.
>
> Once testing of the upstream kernel is complete, please mark this bug as
> "Confirmed".

As I stated on a previous post, I somewhat 'fixed' the issue by
disabling quotas on the root subvolume. I enabled those again and just
rebooted, but there has been no further problems. If this ever happens
again I will try the latest upstream kernel.

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.