dbrd8 kernel module and padlock-sha kernel module in deadlock
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
drbd8 (Ubuntu) |
New
|
High
|
Unassigned |
Bug Description
Running 2.6.32-37-server on Ubuntu 10.04. The issue is a bit hard to explain, so bear with me.
- I've got a DRBD setup running, but my DRBD device got messed up. I'm not entirely sure how, but it happend.
- The DRBD resource is configured with "verify-alg sha1" and "csums-alg sha1".
- Now when drbdsetup runs to initialize the device, the command doesn't exit, and there's some crash information in syslog.
- I can also see a command "/sbin/modprobe -q -- sha1_all" being run that also doesn't seem to exit, which on the DRBD peer node (identical setup) exits immediately.
- I also see errors about VIA Padlock devices not existing.
My assumption is that because there's some bad data on the block device underlying the DRBD resource, DRBD tries to check the device, tries to use SHA1 as configured, but somehow loading the SHA1 module deadlocks with whatever DRBD is trying to do. If I remove the kernel module "padlock-sha" and hard reboot, everything works as expected.
This is what's in syslog:
Jan 16 13:29:21 htz0 kernel: [ 178.247889] BUG: soft lockup - CPU#7 stuck for 61s! [kstop/7:1657]
Jan 16 13:29:21 htz0 kernel: [ 178.248206] Modules linked in: padlock_sha(-) sha1_generic drbd ipt_REJECT ipt_LOG xt_limit xt_tcpudp ipt_addrtype xt_state ip6table_filter ip6_tables nf_nat_irc nf_conntrack_irc nf_nat_ftp nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack_ftp nf_conntrack iptable_filter ip_tables x_tables fbcon tileblit font lp bitblit softcursor vga16fb video parport xhci vgastate output multipath linear aacraid 3w_9xxx 3w_xxxx raid10 raid456 async_pq async_xor xor async_memcpy async_raid6_recov raid6_pq async_tx raid1 raid0 sata_nv r8169 ahci mii sata_sil sata_via
Jan 16 13:29:21 htz0 kernel: [ 178.248245] CPU 7:
Jan 16 13:29:21 htz0 kernel: [ 178.248247] Modules linked in: padlock_sha(-) sha1_generic drbd ipt_REJECT ipt_LOG xt_limit xt_tcpudp ipt_addrtype xt_state ip6table_filter ip6_tables nf_nat_irc nf_conntrack_irc nf_nat_ftp nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack_ftp nf_conntrack iptable_filter ip_tables x_tables fbcon tileblit font lp bitblit softcursor vga16fb video parport xhci vgastate output multipath linear aacraid 3w_9xxx 3w_xxxx raid10 raid456 async_pq async_xor xor async_memcpy async_raid6_recov raid6_pq async_tx raid1 raid0 sata_nv r8169 ahci mii sata_sil sata_via
Jan 16 13:29:21 htz0 kernel: [ 178.248282] Pid: 1657, comm: kstop/7 Not tainted 2.6.32-37-server #81-Ubuntu System Product Name
Jan 16 13:29:21 htz0 kernel: [ 178.248284] RIP: 0010:[<
Jan 16 13:29:21 htz0 kernel: [ 178.248291] RSP: 0018:ffff880417
Jan 16 13:29:21 htz0 kernel: [ 178.248293] RAX: 0000000000000001 RBX: ffff8804170e3e00 RCX: 0000000000000000
Jan 16 13:29:21 htz0 kernel: [ 178.248296] RDX: ffffffff81869248 RSI: 0000000000000100 RDI: ffffffff81869240
Jan 16 13:29:21 htz0 kernel: [ 178.248298] RBP: ffffffff81013c6e R08: ffff8804170e2000 R09: 0000000000000000
Jan 16 13:29:21 htz0 kernel: [ 178.248300] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
Jan 16 13:29:21 htz0 kernel: [ 178.248302] R13: 0000000000000001 R14: 0000000000000001 R15: 0000000000000400
Jan 16 13:29:21 htz0 kernel: [ 178.248305] FS: 000000000000000
Jan 16 13:29:21 htz0 kernel: [ 178.248308] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
Jan 16 13:29:21 htz0 kernel: [ 178.248310] CR2: 00007fa7f1c65beb CR3: 0000000001001000 CR4: 00000000000406e0
Jan 16 13:29:21 htz0 kernel: [ 178.248312] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jan 16 13:29:21 htz0 kernel: [ 178.248314] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jan 16 13:29:21 htz0 kernel: [ 178.248316] Call Trace:
Jan 16 13:29:21 htz0 kernel: [ 178.248323] [<ffffffff81081
Jan 16 13:29:21 htz0 kernel: [ 178.248328] [<ffffffff81081
Jan 16 13:29:21 htz0 kernel: [ 178.248332] [<ffffffff81086
Jan 16 13:29:21 htz0 kernel: [ 178.248337] [<ffffffff81081
Jan 16 13:29:21 htz0 kernel: [ 178.248340] [<ffffffff81085
Jan 16 13:29:21 htz0 kernel: [ 178.248344] [<ffffffff81014
Jan 16 13:29:21 htz0 kernel: [ 178.248348] [<ffffffff81085
Jan 16 13:29:21 htz0 kernel: [ 178.248351] [<ffffffff81014
Hope that helps; feel free to ask for more information.
Jens
Changed in drbd8 (Ubuntu): | |
importance: | Undecided → High |
Hi Jens-
Thanks for reporting this. This sounds like a tough one to debug, any chance you've hit this again since reporting? I assume this is on a production system and there's no chance of you trying to reproduce on demand. In the event that you do hit this again, before rebooting, can you capture the output of 'dmesg', 'cat /proc/drbd', 'ps aux | grep "D "' and 'netstat -ntp | grep 7788' (or whatever port the resource is configured to use) from both nodes. Also syslogs from both nodes would be helpful as well as your drbd.conf (with any sensitive information edited)
When you say your device got messed up, do you mean there was divergent data on both nodes or that they disconnected and were required to carry out a normal resync?