[BTRFS] hard lockup on filserver

Bug #1237794 reported by Sean Clarke
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
High
Unassigned

Bug Description

Hi,
    My Core i7 18TB BTRFS file server has been upgraded to 13.10 and since doing so locks up frequently. Previously it never crashed and was only taken down for maintenance, Monday we upgraded to 13.10 and since then it locks up ever 20hrs or so. We were doing a large backup last night and it locked up within 6 or so hrs.

On the 1st occurrence absolutely nothing was in the logs, on the 2nd occurrence there was some NFS hang messages, when system comes back up I will check and post any logging information gleaned.

Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1237794

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Sean Clarke (sean-clarke) wrote : Re: hard lockup on filserver

Output from:

ubuntu-bug linux

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Sean Clarke (sean-clarke) wrote :

Last nights/this mornings lock up had nothing logged:

Oct 9 20:20:48 enterprise sm-notify[867]: Unable to notify starbug.sec-consulting.co.uk, giving up
Oct 9 20:20:48 enterprise sm-notify[867]: Unable to notify subversion.sec-consulting.co.uk, giving up
Oct 9 20:27:49 enterprise kernel: [ 1391.065867] perf samples too long (2559 > 2500), lowering kernel.perf_event_max_sample_rate to 50000
Oct 9 21:17:01 enterprise CRON[2416]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Oct 9 22:17:01 enterprise CRON[2479]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Oct 9 23:00:12 enterprise kernel: [10530.153330] perf samples too long (5096 > 5000), lowering kernel.perf_event_max_sample_rate to 25000
Oct 9 23:17:01 enterprise CRON[2602]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Oct 10 00:17:01 enterprise CRON[2652]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Oct 10 01:17:01 enterprise CRON[2738]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Oct 10 10:03:24 enterprise kernel: imklog 5.8.11, log source = /proc/kmsg started.
Oct 10 10:03:24 enterprise rsyslogd: [origin software="rsyslogd" swVersion="5.8.11" x-pid="1119" x-info="http://www.rsyslog.com"] start
Oct 10 10:03:24 enterprise rsyslogd: rsyslogd's groupid changed to 103
Oct 10 10:03:24 enterprise rsyslogd: rsyslogd's userid changed to 101
Oct 10 10:03:24 enterprise rsyslogd-2039: Could not open output pipe '/dev/xconsole' [try http://www.rsyslog.com/e/2039 ]
Oct 10 10:03:24 enterprise kernel: [ 0.000000] Initializing cgroup subsys cpuset
Oct 10 10:03:24 enterprise kernel: [ 0.000000] Initializing cgroup subsys cpu
Oct 10 10:03:24 enterprise kernel: [ 0.000000] Initializing cgroup subsys cpuacct
Oct 10 10:03:24 enterprise kernel: [ 0.000000] Linux version 3.11.0-12-generic (buildd@allspice) (gcc version 4.8.1 (Ubuntu/Linaro 4.8.1-10ubuntu7) ) #18-Ubuntu SMP Tue Oct 8 20:51:28 UTC 2013 (Ubuntu 3.11.0-12.18-generic 3.11.3)

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v3.12 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

If you are unable to test the mainline kernel, for example it will not boot, please add the tag: 'kernel-unable-to-test-upstream'.
Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.12-rc4-saucy/

tags: added: saucy
Changed in linux (Ubuntu):
importance: Undecided → High
tags: added: needs-bisect regression-release
Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
Sean Clarke (sean-clarke) wrote :

Now running mainline kernel (3.12.0-999-generic #201310090426) and restesting

Revision history for this message
Sean Clarke (sean-clarke) wrote :

OK, system has run perfectly since I installed the 3.12.0-999-generic #201310090426 kernel.

Changing status as instructed

Revision history for this message
Sean Clarke (sean-clarke) wrote :

Not status (sorry) adding tag.

tags: added: kernel-fixed-upstream
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Sean Clarke (sean-clarke) wrote :

OK, after a week of further testing I don't think the issue is resolved. I am moving about 150GB of data around and the system gives hard locks.

Nothing in the syslog, however I had a top running at failure and the BTRFS processes went through the roof:

   0 0 0 R 100.0 0.0 0:51.50 [btrfs-transacti]
   0 0 0 R 72.1 0.0 0:11.68 [btrfs-flush_del]
   0 0 0 S 72.1 0.0 0:17.36 [btrfs-flush_del]
   0 0 0 R 57.8 0.0 0:12.34 [btrfs-flush_del]
   0 0 0 S 57.5 0.0 0:14.28 [btrfs-flush_del]

I will remove tag

tags: removed: kernel-fixed-upstream
Revision history for this message
Sean Clarke (sean-clarke) wrote :

Installed 3.12.0-999-generic #201310170405 and retrying.

Revision history for this message
Sean Clarke (sean-clarke) wrote :

OK, it is reproducible - I have a filserver with 6x 3TB in a BTRFS RAID 1+0 configuration.

From a client (and using NFS) I copy a 95GB tar file from the fileserver to a USB HD.

It seems at the very end (when BTRFS deletes the 95GB file on the server it falls over - btrfs-transaction and btrfs-flush_del using 3 to 5 cores at 50 to 100%.

Revision history for this message
Sean Clarke (sean-clarke) wrote :

Upgraded to 3.12.0-999-generic #201310210405 and still get lockups and still with absolutely nothing on the kernel logs.

Revision history for this message
Sean Clarke (sean-clarke) wrote :

Situation is intolerable, crashed 2nd time in 12 hrs:

top - 12:06:37 up 1:51, 1 user, load average: 3.33, 0.87, 0.33
Tasks: 226 total, 5 running, 221 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.0 us, 48.5 sy, 0.0 ni, 50.9 id, 0.1 wa, 0.0 hi, 0.5 si, 0.0 st
KiB Mem: 12295372 total, 12138296 used, 157076 free, 312 buffers
KiB Swap: 7831536 total, 56 used, 7831480 free, 24644 cached

  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
 1005 root 20 0 0 0 0 R 100.1 0.0 0:50.16 btrfs-transacti
 2462 root 20 0 0 0 0 R 75.8 0.0 0:06.66 btrfs-flush_del
 2459 root 20 0 0 0 0 S 72.5 0.0 0:10.18 btrfs-flush_del
 2463 root 20 0 0 0 0 S 70.5 0.0 0:14.41 btrfs-flush_del
 2457 root 20 0 0 0 0 R 38.6 0.0 0:28.35 btrfs-flush_del
 2458 root 20 0 0 0 0 S 17.0 0.0 0:39.31 btrfs-flush_del
 1959 root 20 0 0 0 0 R 9.6 0.0 0:31.68 btrfs-flush_del
  100 root 20 0 0 0 0 S 7.6 0.0 0:00.31 kswap

Revision history for this message
Sean Clarke (sean-clarke) wrote :
penalvch (penalvch)
summary: - hard lockup on filserver
+ [BTRFS] hard lockup on filserver
Revision history for this message
Sean Clarke (sean-clarke) wrote :

Now running 3.13.0-031300rc3-generic (from Mainline/Trusty RC) and have had the longest continual uptime since moving to 3.11 (i.e. upgraded to Ubuntu 13.10) - 8 days.

Will keep and eye on the situation, but I also have had backups running and so far it looks stable.

Will continue to monitor.

Revision history for this message
Sean Clarke (sean-clarke) wrote :

Now over 2 weeks without a failure, and up until this kernel it never lasted anywhere near this.

Revision history for this message
Sean Clarke (sean-clarke) wrote :

Happy to close as fixed in 3.13

Revision history for this message
Sean Clarke (sean-clarke) wrote :

Fixed in 14.04 (3.13 kernel)

Changed in linux (Ubuntu):
status: Confirmed → Invalid
status: Invalid → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.