Machine lockup in btrfs-transaction

Bug #1349711 reported by Peter Waller on 2014-07-29
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Linux
Unknown
Unknown
linux (Ubuntu)
High
Unassigned

Bug Description

This has happened twice now.

I'm on an AWS EC2 m3.large instance with the official Ubuntu AMI ami-776d9700.

# cat /proc/version_signature
Ubuntu 3.13.0-32.57-generic 3.13.11.4

After running for many days, the machine locked up with the below messages appearing on the console. The machine would respond to ping but not SSH or HTTP requests. The machine has one BTRFS volume which is 87% full and lives on an Logical Volume Manager (LVM) block device on top of one Amazon Elastic Block Store (EBS) device.

Error messages after first reboot:

[ 77.609490] BTRFS error (device dm-0): block group 10766778368 has wrong amount of free space
[ 77.613678] BTRFS error (device dm-0): failed to load free space cache for block group 10766778368
[ 77.643801] BTRFS error (device dm-0): block group 19356712960 has wrong amount of free space
[ 77.648952] BTRFS error (device dm-0): failed to load free space cache for block group 19356712960
[ 77.926325] BTRFS error (device dm-0): block group 20430454784 has wrong amount of free space
[ 77.931078] BTRFS error (device dm-0): failed to load free space cache for block group 20430454784
[ 78.111437] BTRFS error (device dm-0): block group 21504196608 has wrong amount of free space
[ 78.116165] BTRFS error (device dm-0): failed to load free space cache for block group 21504196608

Error messages after second reboot:

[ 45.390221] BTRFS error (device dm-0): free space inode generation (0) did not match free space cache generation (70012)
[ 45.413472] BTRFS error (device dm-0): free space inode generation (0) did not match free space cache generation (70012)
[ 467.423961] BTRFS error (device dm-0): block group 518646661120 has wrong amount of free space
[ 467.429251] BTRFS error (device dm-0): failed to load free space cache for block group 518646661120

Error messages on the console after second lock-up follow:

[246736.752053] INFO: rcu_sched self-detected stall on CPU { 0} (t=2220246 jiffies g=35399662 c=35399661 q=0)
[246736.756059] INFO: rcu_sched detected stalls on CPUs/tasks: { 0} (detected by 1, t=2220247 jiffies, g=35399662, c=35399661, q=0)
[246764.192014] BUG: soft lockup - CPU#0 stuck for 23s! [kworker/u30:2:1828]
[246764.212058] BUG: soft lockup - CPU#1 stuck for 23s! [btrfs-transacti:492]
[246792.192022] BUG: soft lockup - CPU#0 stuck for 23s! [kworker/u30:2:1828]
[246792.212057] BUG: soft lockup - CPU#1 stuck for 23s! [btrfs-transacti:492]
[246820.192052] BUG: soft lockup - CPU#0 stuck for 23s! [kworker/u30:2:1828]
[246820.212018] BUG: soft lockup - CPU#1 stuck for 23s! [btrfs-transacti:492]
[246848.192052] BUG: soft lockup - CPU#0 stuck for 23s! [kworker/u30:2:1828]
[246848.212058] BUG: soft lockup - CPU#1 stuck for 23s! [btrfs-transacti:492]
[246876.192053] BUG: soft lockup - CPU#0 stuck for 22s! [kworker/u30:2:1828]
[246876.212057] BUG: soft lockup - CPU#1 stuck for 22s! [btrfs-transacti:492]
[246904.192053] BUG: soft lockup - CPU#0 stuck for 22s! [kworker/u30:2:1828]
[246904.212058] BUG: soft lockup - CPU#1 stuck for 22s! [btrfs-transacti:492]
[246916.772052] INFO: rcu_sched self-detected stall on CPU[246916.776058] INFO: rcu_sched detected stalls on CPUs/tasks:
[246944.192053] BUG: soft lockup - CPU#0 stuck for 22s! [kworker/u30:2:1828]
[246944.212058] BUG: soft lockup - CPU#1 stuck for 22s! [btrfs-transacti:492]
[246972.192053] BUG: soft lockup - CPU#0 stuck for 22s! [kworker/u30:2:1828]
[246972.212018] BUG: soft lockup - CPU#1 stuck for 22s! [btrfs-transacti:492]
[247000.192053] BUG: soft lockup - CPU#0 stuck for 22s! [kworker/u30:2:1828]
[247000.212058] BUG: soft lockup - CPU#1 stuck for 22s! [btrfs-transacti:492]
[247028.192054] BUG: soft lockup - CPU#0 stuck for 22s! [kworker/u30:2:1828]
[247028.212058] BUG: soft lockup - CPU#1 stuck for 22s! [btrfs-transacti:492]
[247056.192053] BUG: soft lockup - CPU#0 stuck for 23s! [kworker/u30:2:1828]
[247056.212061] BUG: soft lockup - CPU#1 stuck for 23s! [btrfs-transacti:492]

tags: added: soft-lockup
tags: added: btrfs-transaction
description: updated

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1349711

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Peter Waller (peter.waller) wrote :

I've also started a thread on linux-btrfs:

http://thread.gmane.org/gmane.comp.file-systems.btrfs/37224

Peter Waller (peter.waller) wrote :

@brad-figg, apologies I missed your response. Is there a way to generate the output without automatically uploading it? I would like to review it first. I tried `apport-cli --save` but that doesn't do anything unless there are any crash files that I can tell.

Joseph Salisbury (jsalisbury) wrote :

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v3.16 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

If you are unable to test the mainline kernel, for example it will not boot, please add the tag: 'kernel-unable-to-test-upstream'.
Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.16-rc7-utopic/

Changed in linux (Ubuntu):
importance: Undecided → High
tags: added: kernel-da-key
Peter Waller (peter.waller) wrote :

The crashes became more frequent. The approximate time was >12 days running, then ~2 days running, then 6 hours, then 1 hour.

I since moved to 3.15.7-031507-generic.

One thing I have observed is that (EXT4 filesystem) /var/log/nginx/access.log contained ~2KB of NULL characters in place of any entries at the point when the machine froze.

Peter Waller (peter.waller) wrote :

One thing I am unsure of is that the bug did not manifest for at least 12 days running originally. So I'm not sure it is going to be possible to reliably decide that it is fixed by moving to a particular kernel. What is the standard here?

Peter Waller (peter.waller) wrote :

I found an additional stack trace from a previous machine lockup.

[1093202.136107] INFO: task kworker/u30:1:31455 blocked for more than 120 seconds.
[1093202.141596] Tainted: GF 3.13.0-30-generic #54-Ubuntu
[1093202.146201] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1093202.152408] kworker/u30:1 D ffff8801efc34440 0 31455 2 0x00000000
[1093202.152416] Workqueue: writeback bdi_writeback_workfn (flush-btrfs-1)
[1093202.152419] ffff880040d3b768 0000000000000002 ffff8800880f0000 ffff880040d3bfd8
[1093202.152422] 0000000000014440 0000000000014440 ffff8800880f0000 ffff8801efc34cd8
[1093202.152426] ffff8801effcefe8 ffff880040d3b7f0 0000000000000002 ffffffff8114e010
[1093202.152429] Call Trace:
[1093202.152435] [<ffffffff8114e010>] ? wait_on_page_read+0x60/0x60
[1093202.152439] [<ffffffff8171ea6d>] io_schedule+0x9d/0x140
[1093202.152442] [<ffffffff8114e01e>] sleep_on_page+0xe/0x20
[1093202.152445] [<ffffffff8171eff8>] __wait_on_bit_lock+0x48/0xb0
[1093202.152449] [<ffffffff8109df64>] ? vtime_common_task_switch+0x24/0x40
[1093202.152452] [<ffffffff8114e12a>] __lock_page+0x6a/0x70
[1093202.152456] [<ffffffff810aaee0>] ? autoremove_wake_function+0x40/0x40
[1093202.152474] [<ffffffffa00e8a0d>] lock_delalloc_pages+0x13d/0x1d0 [btrfs]
[1093202.152487] [<ffffffffa00eaf2b>] find_lock_delalloc_range.constprop.43+0x14b/0x1f0 [btrfs]
[1093202.152499] [<ffffffffa00ebfa1>] __extent_writepage+0x131/0x750 [btrfs]
[1093202.152509] [<ffffffffa00eb040>] ? end_extent_writepage+0x70/0x70 [btrfs]
[1093202.152521] [<ffffffffa00ec832>] extent_write_cache_pages.isra.30.constprop.50+0x272/0x3d0 [btrfs]
[1093202.152532] [<ffffffffa00edafd>] extent_writepages+0x4d/0x70 [btrfs]
[1093202.152544] [<ffffffffa00d3b20>] ? btrfs_real_readdir+0x5b0/0x5b0 [btrfs]
[1093202.152555] [<ffffffffa00d1cb8>] btrfs_writepages+0x28/0x30 [btrfs]
[1093202.152559] [<ffffffff8115a46e>] do_writepages+0x1e/0x40
[1093202.152562] [<ffffffff811e5c50>] __writeback_single_inode+0x40/0x210
[1093202.152565] [<ffffffff811e6a07>] writeback_sb_inodes+0x247/0x3e0
[1093202.152568] [<ffffffff811e6c3f>] __writeback_inodes_wb+0x9f/0xd0
[1093202.152571] [<ffffffff811e6eb3>] wb_writeback+0x243/0x2c0
[1093202.152574] [<ffffffff811e86d8>] bdi_writeback_workfn+0x108/0x430
[1093202.152577] [<ffffffff810974a8>] ? finish_task_switch+0x128/0x170
[1093202.152581] [<ffffffff810838a2>] process_one_work+0x182/0x450
[1093202.152585] [<ffffffff81084641>] worker_thread+0x121/0x410
[1093202.152588] [<ffffffff81084520>] ? rescuer_thread+0x3e0/0x3e0
[1093202.152591] [<ffffffff8108b322>] kthread+0xd2/0xf0
[1093202.152594] [<ffffffff8108b250>] ? kthread_create_on_node+0x1d0/0x1d0
[1093202.152598] [<ffffffff8172ac3c>] ret_from_fork+0x7c/0xb0
[1093202.152601] [<ffffffff8108b250>] ? kthread_create_on_node+0x1d0/0x1d0

Peter Waller (peter.waller) wrote :
Download full text (5.3 KiB)

I've got a way to rapidly reproduce the error now. I can do it reliably with a turnaround time of 5-10 minutes.

I've reproduced the crash on the new Kernel, so it has now been observed on both 3.13.0-32-generic and 3.15.7-031507-generic. I'll try 3.16 next.

I've also discovered this new stack trace at the point of the crash (setup_cluster_bitmap) which happens every time:

[97056.916053] NMI backtrace for cpu 1
[97056.916053] CPU: 1 PID: 1107 Comm: kworker/u30:1 Not tainted 3.13.0-32-generic #57-Ubuntu
[97056.916053] Hardware name: Xen HVM domU, BIOS 4.2.amazon 05/23/2014
[97056.916053] Workqueue: writeback bdi_writeback_workfn (flush-btrfs-1)
[97056.916053] task: ffff8800e8018000 ti: ffff8800da1ac000 task.ti: ffff8800da1ac000
[97056.916053] RIP: 0010:[<ffffffff81044a8c>] [<ffffffff81044a8c>] default_send_IPI_mask_sequence_phys+0xbc/0x100
[97056.916053] RSP: 0018:ffff8801efc23c60 EFLAGS: 00000046
[97056.916053] RAX: 0000000000000400 RBX: 000000000000b06a RCX: 0000000000000001
[97056.916053] RDX: 0000000000000001 RSI: 0000000000000001 RDI: 0000000000000300
[97056.916053] RBP: ffff8801efc23c98 R08: ffffffff81d13780 R09: 0000000000000d2e
[97056.916053] R10: 0000000000000000 R11: ffff8801efc239e6 R12: ffffffff81d13780
[97056.916053] R13: 0000000000000400 R14: 0000000000000086 R15: 0000000000000002
[97056.916053] FS: 0000000000000000(0000) GS:ffff8801efc20000(0000) knlGS:0000000000000000
[97056.916053] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[97056.916053] CR2: 00007f803f12e000 CR3: 0000000101cc2000 CR4: 00000000001406e0
[97056.916053] Stack:
[97056.916053] 0000000100000008 0000000000000001 0000000000002710 ffffffff81c4e1c0
[97056.916053] ffffffff81d137a0 ffffffff81c4e1c0 0000000000000001 ffff8801efc23ca8
[97056.916053] ffffffff81049217 ffff8801efc23cc0 ffffffff81044c60 ffff8801efc2e800
[97056.916053] Call Trace:
[97056.916053] <IRQ>
[97056.916053] [<ffffffff81049217>] physflat_send_IPI_all+0x17/0x20
[97056.916053] [<ffffffff81044c60>] arch_trigger_all_cpu_backtrace+0x70/0xb0
[97056.916053] [<ffffffff810ca9ee>] rcu_check_callbacks+0x3fe/0x650
[97056.916053] [<ffffffff81076227>] update_process_times+0x47/0x70
[97056.916053] [<ffffffff810d5cf5>] tick_sched_handle.isra.17+0x25/0x60
[97056.916053] [<ffffffff810d5d71>] tick_sched_timer+0x41/0x60
[97056.916053] [<ffffffff8108e5e7>] __run_hrtimer+0x77/0x1d0
[97056.916053] [<ffffffff810d5d30>] ? tick_sched_handle.isra.17+0x60/0x60
[97056.916053] [<ffffffff8108edaf>] hrtimer_interrupt+0xef/0x230
[97056.916053] [<ffffffff81009fef>] xen_timer_interrupt+0x2f/0x150
[97056.916053] [<ffffffff81009ef0>] ? xen_clocksource_read+0x20/0x30
[97056.916053] [<ffffffff8101b7e9>] ? sched_clock+0x9/0x10
[97056.916053] [<ffffffff8109d1ad>] ? sched_clock_local+0x1d/0x80
[97056.916053] [<ffffffff810bf78e>] handle_irq_event_percpu+0x3e/0x1d0
[97056.916053] [<ffffffff810c2bbe>] handle_percpu_irq+0x3e/0x60
[97056.916053] [<ffffffff8142dba7>] __xen_evtchn_do_upcall+0x317/0x320
[97056.916053] [<ffffffff8109d98e>] ? __vtime_account_system+0x2e/0x40
[97056.916053] [<ffffffff8109dd2c>] ? vtime_account_system+0x3c/0x50
[97056.916053] [<ffffffff8142fc8b>] xen_evtchn_do_upcall+0x2b/0x50
[97056.916053] [<ffffffff8172e22d>] xe...

Read more...

Peter Waller (peter.waller) wrote :

Now reproduced on 3.16. I'm out of things to try for now.

Peter Waller (peter.waller) wrote :

This gist contains a stack trace every 10 seconds taken with `echo l > /proc/sysrq-trigger` whilst the machine was spinning in the kernel but still responsive.

https://gist.github.com/pwaller/c7dd0f4807459acedcdf

The machine remained responsive for 5-10 minutes before becoming totally unresponsive to network activity.

The test case just involves copying every sqlite file in a directory tree containing ~400GB of files of various sizes to new files in the same directory with `.new` on the end of the filename.

tags: added: kernel-bug-exists-upstream
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Chris J Arges (arges) wrote :

From IRC, an example of a testcase:

1) create around ~300GB of small files (sqlite files for example), put the files into a list sqlite-files.txt
2) Start the copy:
cat sqlite-files.txt | xargs -n1 -I{} -P2 sudo sh -c 'rm -f {}.new; cat {} > {}.new; echo {}'
3) When it hangs, identify where it hung as $NUM and resume with the following:
tail -n+$NUM | xargs -n1 -I{} -P2 sudo sh -c 'rm -f {}.new; cat {} > {}.new; echo {}'

Chris J Arges (arges) wrote :

Peter also tested on a single core machine and was able to reproduce the issue.

Chris J Arges (arges) wrote :

So at this point it would be good to have a clear idea of how to get into this state to better understand the issue.
Could you write a detailed description of how you setup your machine/volume, then others can try to run a similar test here and see if we can reproduce the same results?
Thanks,

Peter Waller (peter.waller) wrote :

Hm, I'm not sure I can give a thorough description since I don't understand enough about the exact workload myself. It is a fairly arbitrary workload generated by our users.

In the end, it boils down to creating, reading and writing many (~20,000) sqlite files of size 16kb - 12GB across many folders and doing random read/write IO to them. The directory structure is that all 20,000 files live inside a directory of one root directory, like so:

/path/00001/file.sqlite
/path/00002/file.sqlite

etc.

The 500GB volume has approximately 275GB of such files. When copying /path/00001/file.sqlite to /path/00001/file.sqlite.new (and so on) with `cat` twice in parallel (via xargs -P2 as in #11), the volume eventually (after multiple hours) hangs. If the copying is resumed from the last file successfully copied before the hang, the hang onset is very rapid.

Peter Waller (peter.waller) wrote :

(otherwise unloaded test machines)

On a dual core machine, 100% system CPU usage with zero writes is seen on one core for 5-10 minutes, spending time in BTRFS threads.

On a single thread machine 100% system CPU is used and I haven't yet been able to cause it to hang entirely. I do observe almost completely 100% system CPU usage and very low IO rates, 0 IOPS for up to minutes at a time before it then goes to >1000 IOPs.

The system goes to IOWAIT when there is significant IO traffic, and otherwise kernel threads are consuming CPU.

However the latter is on a non-prewarmed EBS. So I am not sure if the behaviour we're seeing is entirely due to cold block storage where we have no latency/rate guarantees.

Generally, I can't tell if we have hit a bug in the BTRFS free extent code, or if it is all due to bad EBS performance and a spinlock somewhere.

Stefan Bader (smb) wrote :

It is hard to say for sure. I would not know the details of EC2 guest setup. The block device could be local but more likely is some form of network attachment (iscsi, ndb). And btrfs is still a newer kid on the block. So there still might be surprises just because of that.

You description was about what we wanted. Something that might be used to come up with a reproducer which in turn could be run on bare-metal or a different hypervisor to prove or rule out a tie with EC2/Xen. One detail I did not see was whether the btrfs was created with special options (what mkfs.btrfs line was used). So we can get the as close as possible (minus the cloudy details).

For your production systems. Is using btrfs a hard requirement or could those at least work around by getting moved to ext4 or xfs? At least until the btrfs issue is understood. Which could take a bit (at least it is like reading a crystal ball to make any estimate).

Peter Waller (peter.waller) wrote :

btrfs was created with `mkfs.btrfs /dev/mapper/vg-lv`.

It isn't a hard requirement except that it's a pain to migrate since that requires downtime to move the files. Something I'd rather not do unless absolutely necessary. The machine freezes are inconvenient but represent a few minutes downtime if the machine is rebooted on hangup.

Stefan Bader (smb) wrote :

Hm, odd. To get a similar output to "btrfs fi df <mount>", I have to add -m dup (otherwise I have no DUP lines for system and metadata). And especially the metadata-dup was relatively high in use 4.65GiB of 5.50GiB on the output you showed us yesterday.

Peter Waller (peter.waller) wrote :

The filesystem may have been originally created on an older version of BTRFS from Ubuntu Saucy, which I suppose may not have detected the SSD?

Stefan Bader (smb) wrote :

Maybe different default options. I don't think a PV guest will know the difference of the backing real device. Would those not all be just blkfront pv block devices? Probably another detail we might be interested in here. If the device name for the pv was xvd* then its a pv disk (not using ec2 that much to know all things that might be possible).
And I am not saying the metadata+DUP thing is bad. But maxybe something to ask those people that have a better clue on btrfs. ;)

Peter Waller (peter.waller) wrote :

smb: Yeah, the system the filesystem was created on was PV, the device name was xvd*. Now it's on HVM with xvd*.

Peter Waller (peter.waller) wrote :
Download full text (4.5 KiB)

The production machine hasn't had a lockup since moving to 3.15.7-031507-generic (it's been up for 4 days) even though we could reproduce the lockup on a new machine with that kernel using a snapshot of the old volume.

Another twist is that on the productino machine I'm now reliably seeing "No space left on device", even though there appears to be in principle 63GB remaining:

$ btrfs fi df /path/to/volume
Data, single: total=489.97GiB, used=427.75GiB
System, DUP: total=8.00MiB, used=60.00KiB
System, single: total=4.00MiB, used=0.00
Metadata, DUP: total=5.00GiB, used=4.50GiB
Metadata, single: total=8.00MiB, used=0.00
unknown, single: total=512.00MiB, used=0.00

$ sudo btrfs fi show /path/to/volume
Label: none uuid: 3ffd71ab-6c3d-4486-a6b0-5c1eeb9be6b3
 Total devices 1 FS bytes used 432.25GiB
 devid 1 size 500.00GiB used 500.00GiB path /dev/dm-0

The ENOSPC is happening for mkdir and rename syscalls in particular.

I've posted a mail to the BTRFS list about this: http://thread.gmane.org/gmane.comp.file-systems.btrfs/37415

I did a rebalance with `btrfs balance start -dusage=10` (increasing 10) to try and gain more space for metadata, but this didn't fix the situation. I did however get this stack trace in dmesg.

In the end, I had to enlarge the volume before it became usable again.

[375794.106653] ------------[ cut here ]------------
[375794.106676] WARNING: CPU: 1 PID: 24706 at /home/apw/COD/linux/fs/btrfs/extent-tree.c:6946 use_block_rsv+0xfd/0x1a0 [btrfs]()
[375794.106678] BTRFS: block rsv returned -28
[375794.106679] Modules linked in: softdog tcp_diag inet_diag dm_crypt ppdev xen_fbfront fb_sys_fops syscopyarea sysfillrect sysimgblt i2c_piix4 serio_raw parport_pc parport mac_hid isofs xt_tcpudp iptable_filter xt_owner ip_tables x_tables btrfs xor raid6_pq crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd floppy psmouse
[375794.106702] CPU: 1 PID: 24706 Comm: twsearch.py Not tainted 3.15.7-031507-generic #201407281235
[375794.106703] Hardware name: Xen HVM domU, BIOS 4.2.amazon 05/23/2014
[375794.106705] 0000000000001b22 ffff88016db437c8 ffffffff8176f115 0000000000000007
[375794.106707] ffff88016db43818 ffff88016db43808 ffffffff8106ceac ffff8801e4890000
[375794.106709] ffff8800a71ab9c0 ffff8801aedcd800 0000000000001000 ffff88001c987000
[375794.106711] Call Trace:
[375794.106718] [<ffffffff8176f115>] dump_stack+0x46/0x58
[375794.106721] [<ffffffff8106ceac>] warn_slowpath_common+0x8c/0xc0
[375794.106723] [<ffffffff8106cf96>] warn_slowpath_fmt+0x46/0x50
[375794.106731] [<ffffffffa00d9d1d>] use_block_rsv+0xfd/0x1a0 [btrfs]
[375794.106739] [<ffffffffa00de687>] btrfs_alloc_free_block+0x57/0x220 [btrfs]
[375794.106746] [<ffffffffa00c8a3c>] btrfs_copy_root+0xfc/0x2b0 [btrfs]
[375794.106757] [<ffffffffa013a583>] ? create_reloc_root+0x33/0x2c0 [btrfs]
[375794.106767] [<ffffffffa013a743>] create_reloc_root+0x1f3/0x2c0 [btrfs]
[375794.106776] [<ffffffffa0140eb8>] btrfs_init_reloc_root+0xb8/0xd0 [btrfs]
[375794.106784] [<ffffffffa00ee967>] record_root_in_trans.part.30+0x97/0x100 [btrfs]
[375794.106792] [<ffffffffa00ee9f4>] record_root_in_trans+0x24/0x30 [btrfs]
[37...

Read more...

Peter Waller (peter.waller) wrote :

Repost of what I sent to the mailing list just now:

My current interpretation of this problem is that it is some
pathological condition caused by not rebalancing and being nearly out
of space for allocating more metadata and hence it is rarely being
seen by anyone else (because most users are regularly doing
rebalances).

See this thread for details about rebalancing and out of space:
"ENOSPC with mkdir and rename" on 2014-08-02:

http://thread.gmane.org/gmane.comp.file-systems.btrfs/37415

I haven't had the lockups in production since July and I'm now trialling a
nightly rebalance:

$ btrfs filesystem balance start -dusage=50 -musage=10 "$mount"

I'll report back if I encounter further problems.

Stefan Bader (smb) wrote :

So the way I read the thread, there is the (apparently better known for developers than it is currently documented) basic problem with btrfs that it can run out of space rather unexpectedly. I was a bit surprised as well to read that 500MB (while looking like a whole lot of space coming from using other filesystems) can go away quickly as there are certain big chunks by which allocation is done. And all the information seems to be spread into output of various commands.

What I take from that for the bug report here is that because of getting close to the no space situation the fs is in a state where it needs to do a lot of bookkeeping. The problem with the softlockup warnings is that they can just report something that is not a complete lockup. So basically doing fs operations in the foreground cause data to be written out in the background. If the organizational layout of the fs becomes complicated it takes more processing time to find out where to actually put the data (also the meta-data has to be updated in parallel). At some point (if the fs gets exercised more) the backlog grows and most of the free memory is used up by buffers for the fs. So that explains the very laggy responsiveness of the whole system. And of course at some point there will be a sync point of the fs and that might be waiting on something that is not compled for quite a long time. So maybe with a lot of patience the system (after interrupting the updates) would recover and become responsive again.

Bottom line would be that this is not a lockup but a fallout from over-fullness combined with bad handling (or maybe bad just because things get complicated and so more difficult) of the io. So not much that can be done at that point but has to be prevented in advance. Unfortnunately manually,

Peter Waller (peter.waller) wrote :

That's my understanding too, except in one of the scenarios I observed 100% SYS CPU for long stretches even when there was a significant amount (~50GB) of the device unused.

However, if it was a soft lockup it was for >8 hours, during which the machine was totally unresponsive to HTTP requests, which amounts to a very dead production machine.

In the end I'm not sure what to do about all of this. I'm surprised to find out that BTRFS can back itself into a corner from which it cannot retrieve itself.

Personally for me, 3.18.x kernel was the first one where btrfs finally behaves stable (so far).

Christian Reis (kiko) wrote :

I'll agree with Tomasz -- running the 3.18.1 kernel I've had much better experience with btrfs than with Trusty's stock 3.13.

Peter, would you consider re-formatting your disk with recent btrfs-tools, and using a 3.18+ Linux kernel? I think it's fair to assume that or try if the problem would be resolved. And that would help you move forward as well.

Consider how unstable older versions are warned to be:

"WARNING! - Btrfs v3.14.1 IS EXPERIMENTAL
WARNING! - see http://btrfs.wiki.kernel.org before using"

Dario Bertini (berdario) wrote :

With btrfs-tools 4.0-2, and kernel 4.2.0-34-generic, I got a kthread pegging up 1 core at 100%, and the whole system was almost completely unusable.

this persisted for several minutes. Then it quieted down, and then the problem reappeared... even after rebooting.

I had 2GB of free space. I now freed up a bit more, and I have 30GB... let's see if it'll happen again.

While it was happening, I got some perf dump with:

sudo perf record -g --proc-map-timeout 100000 -a sleep 10

and I found that most of the time was spent in native_queued_spin_lock_slowpath (name not unexpected, since it's a spin lock), I'll attach the perf.data now (I have 3 of them)

Dario Bertini (berdario) wrote :

I got "the connection used to fetch this resource is insecure" and I never receive a response when sending the POST request to upload the attachment on launchpad.net... I'll try again later

Neal McBurnett (nealmcb) wrote :

The two upstream report links above via gmane now say "ArchivedAt Nothing found - bye"
http://web.archive.org has no record of them.

Are updated links available? Did any discussion take place there?

I'm seeing INFO: task btrfs-transacti:979 blocked for more than 120 seconds. and wondering if it is related.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.