Ubuntu
linux package

Umount of Multiple LVM Snapshots Causes 'soft lockup CPU#0 stuck for'

Bug #1115753 reported by Craig on 2013-02-05

This bug affects 3 people

Affects		Status	Importance	Assigned to	Milestone
	linux (Ubuntu)	Invalid	High	Unassigned

Bug Description

Simultaneous umount of multiple lvm snapshots causes system to hang / freeze / deadlock. Window manager freezes. If umount is executed in a non window-manager tty, messages containing "BUG", "blocked for more than xx seconds", and a list of tasks associated with each of the 8 cpu cores will scroll up at regular intervals. The tasks assigned to each cpu core are unchanging over time, appearing to be deadlocked. No relevant information is left in system logs. Hard reset / poweroff is required. Problem does not occur 100% of the time. It is less likely to occur if umount is done shortly after mount. Problem still occurs -- but less frequently -- if lazy umount -l is used. Snapshots are read-only, and are mounted read-only.

ProblemType: Bug
DistroRelease: Ubuntu 12.04
Package: linux-image-3.2.0-37-generic 3.2.0-37.58
ProcVersionSignature: Ubuntu 3.2.0-37.58-generic 3.2.35
Uname: Linux 3.2.0-37-generic x86_64
AlsaVersion: Advanced Linux Sound Architecture Driver Version 1.0.24.
ApportVersion: 2.0.1-0ubuntu17.1
Architecture: amd64
AudioDevicesInUse:
USER PID ACCESS COMMAND
/dev/snd/controlC1: craig 3257 F.... pulseaudio
/dev/snd/controlC0: craig 3257 F.... pulseaudio
CRDA: Error: command ['iw', 'reg', 'get'] failed with exit code 1: nl80211 not found.
Card0.Amixer.info:
Card hw:0 'SB'/'HDA ATI SB at 0xfeb00000 irq 16'
   Mixer name : 'Realtek ALC889'
   Components : 'HDA:10ec0889,1043846b,00100004'
   Controls : 49
   Simple ctrls : 24
Card1.Amixer.info:
Card hw:1 'HDMI'/'HDA ATI HDMI at 0xfea30000 irq 98'
   Mixer name : 'ATI R6xx HDMI'
   Components : 'HDA:1002aa01,00aa0100,00100100'
   Controls : 6
   Simple ctrls : 1
Card1.Amixer.values:
Simple mixer control 'IEC958',0
   Capabilities: pswitch pswitch-joined penum
   Playback channels: Mono
   Mono: Playback [on]
Date: Mon Feb 4 18:58:37 2013
HibernationDevice: RESUME=UUID=b788f786-ba76-4eb0-991f-d3bbc339d3f7
InstallationMedia: Ubuntu 11.10 "Oneiric Ocelot" - Release amd64 (20111011)
IwConfig:
lo no wireless extensions.

eth0 no wireless extensions.
MachineType: To be filled by O.E.M. To be filled by O.E.M.
MarkForUpload: True
ProcEnviron:
TERM=xterm
PATH=(custom, no user)
LANG=en_US.UTF-8
SHELL=/bin/bash
ProcFB: 0 VESA VGA
ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-3.2.0-37-generic root=/dev/mapper/volgroup0-root ro
RelatedPackageVersions:
linux-restricted-modules-3.2.0-37-generic N/A
linux-backports-modules-3.2.0-37-generic N/A
linux-firmware 1.79.1
RfKill:

SourcePackage: linux
UpgradeStatus: Upgraded to precise on 2012-10-23 (104 days ago)
dmi.bios.date: 09/27/2011
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: 0813
dmi.board.asset.tag: To be filled by O.E.M.
dmi.board.name: Crosshair V Formula
dmi.board.vendor: ASUSTeK Computer INC.
dmi.board.version: Rev 1.xx
dmi.chassis.asset.tag: To Be Filled By O.E.M.
dmi.chassis.type: 3
dmi.chassis.vendor: To Be Filled By O.E.M.
dmi.chassis.version: To Be Filled By O.E.M.
dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvr0813:bd09/27/2011:svnTobefilledbyO.E.M.:pnTobefilledbyO.E.M.:pvrTobefilledbyO.E.M.:rvnASUSTeKComputerINC.:rnCrosshairVFormula:rvrRev1.xx:cvnToBeFilledByO.E.M.:ct3:cvrToBeFilledByO.E.M.:
dmi.product.name: To be filled by O.E.M.
dmi.product.version: To be filled by O.E.M.
dmi.sys.vendor: To be filled by O.E.M.

Tags:

Revision history for this message

Craig (craig-st) wrote on 2013-02-05:

AcpiTables.txt Edit (138.3 KiB, text/plain; charset="utf-8")
AlsaDevices.txt Edit (656 bytes, text/plain; charset="utf-8")
AplayDevices.txt Edit (371 bytes, text/plain; charset="utf-8")
ArecordDevices.txt Edit (295 bytes, text/plain; charset="utf-8")
BootDmesg.txt Edit (74.3 KiB, text/plain; charset="utf-8")
Card0.Amixer.values.txt Edit (4.8 KiB, text/plain; charset="utf-8")
Card0.Codecs.codec.0.txt Edit (14.7 KiB, text/plain; charset="utf-8")
Card1.Codecs.codec.0.txt Edit (1.1 KiB, text/plain; charset="utf-8")
CurrentDmesg.txt Edit (253.1 KiB, text/plain; charset="utf-8")
Dependencies.txt Edit (2.0 KiB, text/plain; charset="utf-8")
Lspci.txt Edit (21.5 KiB, text/plain; charset="utf-8")
Lsusb.txt Edit (1.5 KiB, text/plain; charset="utf-8")
PciMultimedia.txt Edit (1.2 KiB, text/plain; charset="utf-8")
ProcCpuinfo.txt Edit (8.4 KiB, text/plain; charset="utf-8")
ProcInterrupts.txt Edit (6.5 KiB, text/plain; charset="utf-8")
ProcModules.txt Edit (3.4 KiB, text/plain; charset="utf-8")
PulseList.txt Edit (24.5 KiB, text/plain; charset="utf-8")
UdevDb.txt Edit (222.7 KiB, text/plain; charset="utf-8")
UdevLog.txt Edit (399.8 KiB, text/plain; charset="utf-8")
WifiSyslog.txt Edit (2.6 MiB, text/plain; charset="utf-8")

Revision history for this message

Brad Figg (brad-figg) wrote on 2013-02-05: Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status:	New → Confirmed

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2013-02-05: Re: Umount of multiple lvm snapshots causes system to hang / freeze / deadlock

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v3.8 kernel[0] (Not a kernel in the daily directory) and install both the linux-image and linux-image-extra .deb packages.

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

If you are unable to test the mainline kernel, for example it will not boot, please add the tag: 'kernel-unable-to-test-upstream'.
Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.8-rc6-raring/

Changed in linux (Ubuntu):
importance:	Undecided → High
status:	Confirmed → Incomplete

Revision history for this message

Craig (craig-st) wrote on 2013-02-06:

I'm reluctant to test a non-release kernel because I don't really have a suitable test environment that is expendable, in the sense that if it were unintentionally clobbered, the result would be painless.

I have been working on a simple, generic script to reproduce the bug independently of the particulars of my backup environment -- replacing bacula and postgres with tar, for example -- but have been unsuccessful. This would allow someone else to test against another kernel. But since I've been unable to accomplish this, I'm kind of stuck.

tags:

added: kernel-unable-to-test-upstream

Revision history for this message

Craig (craig-st) wrote on 2013-02-26:

Screen Shot: Call Trace, Trace Edit (746.2 KiB, image/jpeg)

Attached screenshot. A portion of the text is below. See screenshot for full text. I typed the txt below by hand, so there may be typos, and it is not complete. Log files do not have any relevant information on reboot.:

BUG: soft lockup CPU#0 stuck for 22s! [colord:1705]
BUG: soft lockup CPU#1 stuck for 22s! [umount:25493]
BUG: soft lockup CPU#2 stuck for 22s! [postgres:10171]
BUG: soft lockup CPU#3 stuck for 22s! [cupsd:1673]
BUG: soft lockup CPU#4 stuck for 22s! [Xorg:24230]
BUG: soft lockup CPU#5 stuck for 22s! [master:2693]
BUG: soft lockup CPU#6 stuck for 22s! [gconfd-2:24520]
BUG: soft lockup CPU#7 stuck for 22s! [flush-252:9:25167]
Stack:
Call Trace:
Code: xx xx xx xx...
<IRQ>
<EOI>
INFO: rcu_sched detected stall on CPU 1 (t=105061 jiffies)
INFO: rcu_sched detected stall on CPU 0 (t=105061 jiffies)

Revision history for this message

Craig (craig-st) wrote on 2013-02-26:

Problem persists after kernel upgrade to Linux 3.2.0-38-generic #61-Ubuntu SMP Tue Feb 19 12:18:21 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

Craig (craig-st) on 2013-02-26

summary:

- Umount of multiple lvm snapshots causes system to hang / freeze /
- deadlock
+ Umount of Multiple LVM Snapshots Causes 'soft lockup CPU#0 stuck for'

Revision history for this message

Craig (craig-st) wrote on 2013-02-26:

Replaced tag kernel-unable-to-test-upstream with tag kernel-bug-exists-upstream after confirming bug in kernel '3.2.0-38-generic #61', and changed status from Incomplete to Confirmed.

tags:	added: kernel-bug-exists-upstream removed: kernel-unable-to-test-upstream
Changed in linux (Ubuntu):
status:	Incomplete → Confirmed

Revision history for this message

Craig (craig-st) wrote on 2013-03-24:

Problem persists after kernel upgrade to Linux 3.2.0-39-generic #62-Ubuntu SMP Thu Feb 28 00:28:53 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

Revision history for this message

Craig (craig-st) wrote on 2013-07-22:

Problem may happen less frequently or go away when postgres is stopped and desktop user(s) are logged out before unmounting. Kernel is 3.2.0-49-generic #75-Ubuntu. Will do some additional testing when I have time.

Revision history for this message

markusj (markusj) wrote on 2013-08-03:

#10

Finally there is a bug report for my issue, yay!

This bug hits me on precise with any of the current LTS kernels, i am struggling with it for more than a year, maybe even two. Sometimes all just works fine, sometimes the system locks up. I wasn't able to see the error message since i never had the console open when the issue appeared. Nothing in the log files, too.

Environment:
Kernel 3.8.0-27-generic on Ubuntu 12.04.
LVM on top of cryptsetup/LUKS, seven logical volumes (six data, one swap)
Bacula file daemon running for backups (I use a shell script to set up lvm snapshots (ro) and mount them ro, like Craig.)

Last lines from bacula log:
03-Aug 17:39 markusnb-fd JobId 1706: ClientAfterJob: Unmounting Snapshot of vgluks/storage ... + udevadm settle --quiet
03-Aug 17:39 markusnb-fd JobId 1706: ClientAfterJob: + sync
03-Aug 17:39 markusnb-fd JobId 1706: ClientAfterJob: + local out
03-Aug 17:39 markusnb-fd JobId 1706: ClientAfterJob: ++ umount -f /var/run/lvm-snapshot/storage

Revision history for this message

penalvch (penalvch) wrote on 2013-08-05:

#11

Craig, as per http://www.asus.com/Motherboards/CROSSHAIR_V_FORMULA/#support_Download_8 an update is available for your BIOS (1703). If you update to this, does it change anything?

If not, could you please both specify what happened, and provide the output of the following terminal command:
sudo dmidecode -s bios-version && sudo dmidecode -s bios-release-date

Thank you for your understanding.

tags:	added: bios-outdated-1703 needs-upstream-testing removed: kernel-bug-exists-upstream
Changed in linux (Ubuntu):
status:	Confirmed → Incomplete
tags:	added: regression-potential

Revision history for this message

Craig (craig-st) wrote on 2013-08-24:

#12

Thanks, Chris, for your attention to this matter. Sorry for my delay. I did NOT update my BIOS as you suggested. Before doing so, I wanted to see if the bug was still present. I've been trying to reproduce the bug -- which was intermittent to begin with -- and have been unable to. There have been numerous updates to the kernel and other parts of my operating environment since I reported this bug. If I am able to reproduce the bug at some point in the future, I will proceed as you suggested.

Revision history for this message

Craig (craig-st) wrote on 2013-09-20:

#13

Reproduced the bug. Again, just to repeat so as to avoid any confusion: I have not updated the BIOS yet. I will do that next.

The lockup is a little different now. Whereas previously the system would hang before any of the snapshots were removed, now some of the snapshots are successfully umount'd and lvremove'd before the hang. This could be the result of changes made to the script that does the umount and lvremove, reordering the sequence of the snapshots, which consist of home, root, and var partitions. Another difference in the current lockup behavior is that there are some messages logged to syslog, whereas previously no messages made it into the log. Previously, the messages were only present as scrolling text on the terminal console. The messages that now appear in syslog are:

Sep 20 02:54:35 brain udevd[25253]: inotify_add_watch(6, /dev/dm-6, 10) failed: No such file or directory
Sep 20 02:54:35 brain udevd[25253]: inotify_add_watch(6, /dev/dm-6, 10) failed: No such file or directory
Sep 20 02:54:37 brain udevd[25253]: inotify_add_watch(6, /dev/dm-9, 10) failed: No such file or directory
Sep 20 02:54:37 brain udevd[25253]: inotify_add_watch(6, /dev/dm-9, 10) failed: No such file or directory
Sep 20 02:55:05 brain kernel: [72016.336022] BUG: soft lockup - CPU#3 stuck for 22s! [udisks-lvm-pv-e:26148]
Sep 20 02:55:05 brain kernel: [72016.336026] Modules linked in: dm_snapshot nls_iso8859_1 nls_cp437 vfat fat pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) rpcsec_gss_krb5 ip6table_filter ip6_tables ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle xt_tcpudp iptable_filter ip_tables x_tables bridge stp kvm_amd kvm nvidia(P) snd_hda_codec_hdmi bnep rfcomm bluetooth snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep usblp snd_pcm snd_seq_midi snd_rawmidi pl2303 usbserial snd_seq_midi_event parport_pc joydev ppdev snd_seq cdc_acm snd_timer snd_seq_device snd amd64_edac_mod edac_core it87 psmouse hwmon_vid fam15h_power sp5100_tco k10temp i2c_piix4 soundcore eeepc_wmi edac_mce_amd dm_multipath nfsd shpchp asus_wmi snd_page_alloc nfs sparse_keymap mxm_wmi serio_raw wmi lp mac_hid lockd fscache auth_rpcgss nfs_acl sunrpc parport binfmt_misc dm_crypt vesafb usbhid hid usb_storage aic79xx e1000e [last u
Sep 20 02:55:05 brain kernel: nloaded: ipmi_msghandler]
Sep 20 02:55:05 brain kernel: [72016.336088] CPU 3
Sep 20 02:55:05 brain kernel: [72016.336089] Modules linked in: dm_snapshot nls_iso8859_1 nls_cp437 vfat fat pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) rpcsec_gss_krb5 ip6table_filter ip6_tables ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle xt_tcpudp iptable_filter ip_tables x_tables bridge stp kvm_amd kvm nvidia(P) snd_hda_codec_hdmi bnep rfcomm bluetooth snd_hda_codec_realtek snd_hda_intel snd_hda_codec snSep 20 08:31:02 brain kernel: imklog 5.8.6, log source = /proc/kmsg started.

After the above messages in syslog, the next message in the log is from startup after reboot.

Reproduced the bug.  Again, just to repeat so as to avoid any confusion:  I have not updated the BIOS yet.  I will do that next.

The lockup is a little different now.  Whereas previously the system would hang before any of the snapshots were removed, now some of the snapshots are successfully umount'd and lvremove'd before the hang.  This could be the result of changes made to the script that does the umount and lvremove, reordering the sequence of the snapshots, which consist of home, root, and var partitions.  Another difference in the current lockup behavior is that there are some messages logged to syslog, whereas previously no messages made it into the log.  Previously, the messages were only present as scrolling text on the terminal console.  The messages that now appear in syslog are:

Sep 20 02:54:35 brain udevd[25253]: inotify_add_watch(6, /dev/dm-6, 10) failed: No such file or directory
Sep 20 02:54:35 brain udevd[25253]: inotify_add_watch(6, /dev/dm-6, 10) failed: No such file or directory
Sep 20 02:54:37 brain udevd[25253]: inotify_add_watch(6, /dev/dm-9, 10) failed: No such file or directory
Sep 20 02:54:37 brain udevd[25253]: inotify_add_watch(6, /dev/dm-9, 10) failed: No such file or directory
Sep 20 02:55:05 brain kernel: [72016.336022] BUG: soft lockup - CPU#3 stuck for 22s! [udisks-lvm-pv-e:26148]
Sep 20 02:55:05 brain kernel: [72016.336026] Modules linked in: dm_snapshot nls_iso8859_1 nls_cp437 vfat fat pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) rpcsec_gss_krb5 ip6table_filter ip6_tables ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle xt_tcpudp iptable_filter ip_tables x_tables bridge stp kvm_amd kvm nvidia(P) snd_hda_codec_hdmi bnep rfcomm bluetooth snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep usblp snd_pcm snd_seq_midi snd_rawmidi pl2303 usbserial snd_seq_midi_event parport_pc joydev ppdev snd_seq cdc_acm snd_timer snd_seq_device snd amd64_edac_mod edac_core it87 psmouse hwmon_vid fam15h_power sp5100_tco k10temp i2c_piix4 soundcore eeepc_wmi edac_mce_amd dm_multipath nfsd shpchp asus_wmi snd_page_alloc nfs sparse_keymap mxm_wmi serio_raw wmi lp mac_hid lockd fscache auth_rpcgss nfs_acl sunrpc parport binfmt_misc dm_crypt vesafb usbhid hid usb_storage aic79xx e1000e [last u
Sep 20 02:55:05 brain kernel: nloaded: ipmi_msghandler]
Sep 20 02:55:05 brain kernel: [72016.336088] CPU 3 
Sep 20 02:55:05 brain kernel: [72016.336089] Modules linked in: dm_snapshot nls_iso8859_1 nls_cp437 vfat fat pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) rpcsec_gss_krb5 ip6table_filter ip6_tables ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle xt_tcpudp iptable_filter ip_tables x_tables bridge stp kvm_amd kvm nvidia(P) snd_hda_codec_hdmi bnep rfcomm bluetooth snd_hda_codec_realtek snd_hda_intel snd_hda_codec snSep 20 08:31:02 brain kernel: imklog 5.8.6, log source = /proc/kmsg started.

After the above messages in syslog, the next message in the log is from startup after reboot.

Revision history for this message

Craig (craig-st) wrote on 2013-09-20:

#14

$ uname -a
Linux brain 3.2.0-53-generic #81-Ubuntu SMP Thu Aug 22 21:01:03 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

Revision history for this message

markusj (markusj) wrote on 2013-09-20:

#15

Craig:
> The lockup is a little different now. Whereas previously the system would hang before any of the snapshots were removed, now some of the snapshots are successfully umount'd and lvremove'd before the hang. This could be the result of changes made to the script that does the umount and lvremove, reordering the sequence of the snapshots, which consist of home, root, and var partitions.

That is a red herring. I had done the umounting/removing alternating until this issue appeared to me. I first assumed that lvremove caused the hangs but after dividing the script into a split phase operation (unmount all first, lvremove all as second) the issue appeared always during the unmount step.

But maybe i have found a workaround: Instead of running the script in context of bacula-fd, i use "at" to decouple the unmount/lvremove step from bacula-fd and delay it for about two minutes. I had freezes at maybe two out of three backup jobs before and no one during the last three backup jobs after implementing this change.

I guess the issue lies somewhere in the way kernel, bacula-fd, bash and unmount interact together. Breaking up this chain by running the unmount/lvremove part outside bacula-fd and after the backup job itself has been finished circumvents the issue to appear.

Revision history for this message

Craig (craig-st) wrote on 2013-09-20:

#16

Markus:
Looking back at an earlier version of my script that I was using when I first reported this bug, I used to do exactly what you describe: umount'ing all the partitions before doing any lvremove on any partition. Now I umount and lvremove each partition before proceeding to the next partition. It does seem to point the finger at umount rather than lvremove.

I do not run the scipt in the context of bacula-fd. That is to say, it is not a bacula RunBefore or RunAfter script. I'm assuming that's what you meant. Instead, I have a cron job bourne shell script which does all of the commands that are external to bacula (lvcreate, mount, umount, lvremove, etc), and uses bconsole only to do the bacula backup job. So it looks like this:

...postgresql stuff...
...lvcreate...
...mount...

sudo -E -u bacula /usr/bin/bconsole -c /etc/bacula/bconsole.conf <<EOF
@output /dev/null
messages
@tee "${TMP_LOG}"
run job=brain-all-job client="${CLIENT}" level="${BACKUP_LEVEL}" pool="${BACKUP_POOL}" yes
wait jobname=brain-all-job
messages
@output
quit
EOF

...umount...
...lvremove...

So the bug is occuring for me in the context I've shown above.

Previously I had inserted some code in the scipt to try to fix and/or debug the problem, and then have since commented it out. I was unsuccesful before, but will try again. This is what I had in the script. If anyone has any additional debugging suggestions, please advise:

# time -p udevadm settle
# logger --priority syslog.info --stderr --tag "`basename ${0}`" "umount'ing ${i}"-snap
# blockdev --report /dev/volgroup0/"${i}"-snap /dev/volgroup0/"${i}" /dev/mapper/volgroup0-"${i}"*
# time -p sync
# blockdev --flushbufs /dev/volgroup0/"${i}"-snap /dev/volgroup0/"${i}" /dev/mapper/volgroup0-"${i}"*
# Could also try disabling journal using tune2fs??
# cat /proc/locks > /etc/bacula/debugLocks.txt
# echo >> /etc/bacula/debugLocks.txt
# ps axl >> /etc/bacula/debugLocks.txt
# echo >> /etc/bacula/debugLocks.txt
# fuser -mv /mnt/"${i}"-snap >>/etc/bacula/debugLocks.txt
sync
umount /mnt/"${i}"-snap

Markus:
Looking back at an earlier version of my script that I was using when I first reported this bug, I used to do exactly what you describe:  umount'ing all the partitions before doing any lvremove on any partition.  Now I umount and lvremove each partition before proceeding to the next partition.  It does seem to point the finger at umount rather than lvremove.

I do not run the scipt in the context of bacula-fd.  That is to say, it is not a bacula RunBefore or RunAfter script.  I'm assuming that's what you meant.  Instead, I have a cron job bourne shell script which does all of the commands that are external to bacula (lvcreate, mount, umount, lvremove, etc), and uses bconsole only to do the bacula backup job.  So it looks like this:

...postgresql stuff...
...lvcreate...
...mount...

...umount...
...lvremove...

So the bug is occuring for me in the context I've shown above.

Previously I had inserted some code in the scipt to try to fix and/or debug the problem, and then have since commented it out.  I was unsuccesful before, but will try again.  This is what I had in the script.  If anyone has any additional debugging suggestions, please advise:

#   time -p udevadm settle
#   logger --priority syslog.info --stderr --tag "`basename ${0}`" "umount'ing ${i}"-snap
#   blockdev --report /dev/volgroup0/"${i}"-snap /dev/volgroup0/"${i}" /dev/mapper/volgroup0-"${i}"*
#   time -p sync
#   blockdev --flushbufs /dev/volgroup0/"${i}"-snap /dev/volgroup0/"${i}" /dev/mapper/volgroup0-"${i}"*
    # Could also try disabling journal using tune2fs??
#   cat /proc/locks > /etc/bacula/debugLocks.txt
#   echo >> /etc/bacula/debugLocks.txt
#   ps axl >> /etc/bacula/debugLocks.txt
#   echo >> /etc/bacula/debugLocks.txt
#   fuser -mv /mnt/"${i}"-snap >>/etc/bacula/debugLocks.txt
sync
umount /mnt/"${i}"-snap

Revision history for this message

markusj (markusj) wrote on 2013-09-20: Re: [Bug 1115753] Re: Umount of Multiple LVM Snapshots Causes 'soft lockup CPU#0 stuck for'

#17

Craig:
> So the bug is occuring for me in the context I've shown above.

Strange. However, you guessed right, my script gets called by a RunAfter
directive.

Since the bug did not appear the last three times, the reason might be
different. Maybe its the two minute delay after the job has been
finished? Oh, and the kernel had been upgraded to 3.8.0-29-generic at
24.08.2013 and later to 3.8.0-30 at 07.09.2013, both x86_64 kernels.
The error-free backup-jobs haven been run after the first kernel update
has been done, this kernel might have solved this issue as well.

The next update is scheduled for tomorrow, i will see what happens ;)

Revision history for this message

Craig (craig-st) wrote on 2013-09-20:

#18

I have just now updated the BIOS.

$ sudo dmidecode -s bios-version && sudo dmidecode -s bios-release-date
1703
10/17/2012

tags:

removed: bios-outdated-1703

Revision history for this message

penalvch (penalvch) wrote on 2013-09-23:

#19

Craig, still reproducible with the new BIOS?

As well, since it appears you updated from Oneiric, was this reproducible in it?

Also, could you please test the latest upstream kernel available following https://wiki.ubuntu.com/KernelMainlineBuilds ? It will allow additional upstream developers to examine the issue. Please do not test the daily kernel folder, but the one all the way at the bottom. Once you've tested the upstream kernel, please comment on which kernel version specifically you tested. If this bug is fixed in the mainline kernel, please add the following tags:
kernel-fixed-upstream
kernel-fixed-upstream-VERSION-NUMBER

where VERSION-NUMBER is the version number of the kernel you tested. For example:
kernel-fixed-upstream-v3.12-rc1

This can be done by clicking on the yellow circle with a black pencil icon next to the word Tags located at the bottom of the bug description. As well, please remove the tag:
needs-upstream-testing

If the mainline kernel does not fix this bug, please add the following tags:
kernel-bug-exists-upstream
kernel-bug-exists-upstream-VERSION-NUMBER

As well, please remove the tag:
needs-upstream-testing

Once testing of the upstream kernel is complete, please mark this bug's Status as Confirmed. Please let us know your results. Thank you for your understanding.

tags:

added: latest-bios-1703

Revision history for this message

Craig (craig-st) wrote on 2013-09-24:

#20

So far, backups have run twice without incident after BIOS update.

Revision history for this message

Craig (craig-st) wrote on 2013-09-27:

#21

Syslog messages and traces for BUG: soft lockup - CPU#3 stuck for 22s Edit (19.1 KiB, text/plain)

The lockup happened again last night. kernel: [38460.336022] BUG: soft lockup - CPU#3 stuck for 22s! [umount:24238]. This time there was more information in the syslog than there has been in the past, including a couple call traces and stack traces. The relevant portion of the syslog is attached. I'll try installing the upstream kernel next.

Revision history for this message

Craig (craig-st) wrote on 2013-09-27:

#22

Never installed an upstream kernel before. Just to confirm, I should install:

Index of /~kernel-ppa/mainline/v3.12-rc2-saucy:
linux-headers-3.12.0-031200rc2-generic_3.12.0-031200rc2.201309231935_amd64.deb
linux-image-3.12.0-031200rc2-generic_3.12.0-031200rc2.201309231935_amd64.deb
linux-headers-3.12.0-031200rc2_3.12.0-031200rc2.201309231935_all.deb

Which are located at the following url:
http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.12-rc2-saucy/

Revision history for this message

penalvch (penalvch) wrote on 2013-09-28:

#23

Craig:
>"Just to confirm, I should install Index of /~kernel-ppa/mainline/v3.12-rc2-saucy: linux-headers-3.12.0-031200rc2-generic_3.12.0-031200rc2.201309231935_amd64.deb linux-image-3.12.0-031200rc2-generic_3.12.0-031200rc2.201309231935_amd64.deb linux-headers-3.12.0-031200rc2_3.12.0-031200rc2.201309231935_all.deb
Which are located at the following url: http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.12-rc2-saucy/"

Yes.

Revision history for this message

Craig (craig-st) wrote on 2013-10-01:

#24

Error messages from kernel module generator (fglrx and nvidia) -- and other error messages -- during kernel install, and unable to boot, even in safe mode.

tags:

added: kernel-unable-to-test-upstream

Revision history for this message

penalvch (penalvch) wrote on 2013-10-01:

#25

Craig, could you please confirm this issue exists with the latest development release of Ubuntu? ISO images are available from http://cdimage.ubuntu.com/daily-live/current/ . If the issue remains, please make a comment to this.

Revision history for this message

Craig (craig-st) wrote on 2013-10-01:

#26

Unfortunately, I don't have the resources available to build and configure a clone of my existing system using one of the saucy iso images.

Revision history for this message

Craig (craig-st) wrote on 2013-10-25:

#27

I can't say this with certainty, but seems to happen more often when backup destination is tape, and less often when backup destination is disk. Actually, I always backup to disk first, then copy disk backup to tape. But if the tape drive is offline, it skips the last step (copy to tape), and it seems the umount is less like to cause a problem, or maybe never causes a problem. Not sure. Haven't been watching this behavior rigorously. But I wonder if there could be something in the tape kernel driver that locks something in the filesystem, resulting in a deadlock when there is a umount request.

Revision history for this message

markusj (markusj) wrote on 2013-10-25:

#28

To add some more observations:
After running umount in the context of an "at" job, the issue nearly disappeared completely. Before this change, i already suspected power saving mechanisms to increase the probability of a freeze. Last week i had a freeze again, as the lid of my laptop has been closed during the backup. Turning the display off via DPMS appears to have a similar effect.

Maybe the "at" assumption has been wrong and the issue is more related to the graphics subystem. It would, at least to a certain degree, explain why i never was able to reproduce it by running the script manually. At the other hand, i remember cases in which i had been working and at the end of the backup, the system stopped responding. This might be a red herring again ...

Revision history for this message

penalvch (penalvch) wrote on 2013-10-26:

#29

markusj, would you mind filing a new report so your hardware may be reviewed via a terminal:
ubuntu-bug linux

Revision history for this message

Launchpad Janitor (janitor) wrote on 2013-12-26:

#30

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status:	Incomplete → Expired

Revision history for this message

Tomas Pospisek (tpo-deb) wrote on 2015-11-03:

#31

I am seeing the same problem as reported here. My system and kernel are:

Ubuntu Trusty 14.04
linux 3.13.0-66-generic

Also, since implementing the "sleep 120" workaround mentioned in [1] (thanks a lot for that markusj!), my box is able to do backups cleanly.

I'm also seeing this kind of erratic freezes on another box:

Debian wheezy
3.16.0-0.bpo.4-amd64 #1 SMP Debian 3.16.7-ckt11-1+deb8u4~bpo70+1

So I'll try the same workaround there. And will try to report back here.

Also I'll try to change the status of this bug (due to Ubuntu's automatic bug expiry).

Thanks again to markusj for the workaround. It would be very interesting though, if this problem has maybe been fixed in newer kernels. Has anybody experienced the bug going away by just upgrading the kernel (to which version?).

Thanks,
*t

[1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1115753/comments/15

Changed in linux (Ubuntu):
status:	Expired → Confirmed

Revision history for this message

Craig (craig-st) wrote on 2015-11-03:

#32

I switched from a Radeon graphics card to nVidia, and have not had a problem since.

Revision history for this message

penalvch (penalvch) wrote on 2015-11-05:

#33

Tomas Pospisek, it will help immensely if you filed a new report via a terminal:
ubuntu-bug linux

Please feel free to subscribe me to it.

Changed in linux (Ubuntu):
status:	Confirmed → Incomplete

Revision history for this message

penalvch (penalvch) wrote on 2015-11-05:

#34

Craig, this bug report is being closed due to your last comment https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1115753/comments/32 regarding you are no longer using the same hardware. For future reference you can manage the status of your own bugs by clicking on the current status in the yellow line and then choosing a new status in the revealed drop down box. You can learn more about bug statuses at https://wiki.ubuntu.com/Bugs/Status. Thank you again for taking the time to report this bug and helping to make Ubuntu better. Please submit any future bugs you may find.

Changed in linux (Ubuntu):
status:	Incomplete → Invalid

Revision history for this message

Tomas Pospisek (tpo-deb) wrote on 2015-11-10:

#35

So I'm reporting my current further findings here in case somebody else than me stumbles upon this bug report in search of help.

I'll not create a new bug report as of now, since I'm still researching the problem and I think there's no point in opening a new bug report if it'll turn out to be invalid anyway.

So to the point: in my case the problem has been:

8 05:02:51 foo kernel: [987432.148305] device-mapper: snapshots: Invalidating snapshot: Unable to allocate exception.

after that umount goes berzerk:

  Nov 8 05:02:51 foo kernel: [987432.163602] Buffer I/O error on device dm-11, logical block 11985789
  Nov 8 05:02:51 foo kernel: [987432.163638] Buffer I/O error on device dm-11, logical block 11985790
  ...
  Nov 8 05:02:51 foo kernel: [987432.299393] EXT4-fs warning (device loop0): __ext4_read_dirblock:901: error reading directory block (ino 3539872, block 0)
  Nov 8 05:02:57 foo kernel: [987437.577541] quiet_error: 4396 callbacks suppressed
  Nov 8 05:02:57 foo kernel: [987437.577547] Buffer I/O error on device dm-11, logical block 12622895
  Nov 8 05:02:57 foo kernel: [987437.577609] lost page write due to I/O error on dm-11

and now umount is hanging there with 100% CPU usage with no possibility to `kill` it.

So one problem here is that the `lvsnapshot` is running out of `exception` space. Thus I should either allow the snapshot to be autoextended or to increase the `-L`size I have given it.

But even so, umount shouldn't get stuck on the ext4 FS. Since I was `umount`ing without `-f` I will try that today and see what happens and report back here hopefully.

Revision history for this message

Tomas Pospisek (tpo-deb) wrote on 2015-11-17:

#36

`umount -f` seems have fixed it. No problems until this point.

Revision history for this message

Tomas Pospisek (tpo-deb) wrote on 2015-12-01:

#37

I wrote:

> `umount -f` seems have fixed it. No problems until this point.

Nope, that didn't work either. Now I have instead doubled the lvsnapshot 'exception' space.

Revision history for this message

penalvch (penalvch) wrote on 2015-12-03:

#38

Tomas Pospisek, as this report is closed, if you want your issue addressed you would want to file a new report as previously requested of you in https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1115753/comments/33 .

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.

Ubuntulinux package

Umount of Multiple LVM Snapshots Causes 'soft lockup CPU#0 stuck for'

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntu
linux package