dbrd8 kernel module and padlock-sha kernel module in deadlock

Bug #917134 reported by Jens Finkhäuser
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
drbd8 (Ubuntu)
New
High
Unassigned

Bug Description

Running 2.6.32-37-server on Ubuntu 10.04. The issue is a bit hard to explain, so bear with me.

- I've got a DRBD setup running, but my DRBD device got messed up. I'm not entirely sure how, but it happend.
- The DRBD resource is configured with "verify-alg sha1" and "csums-alg sha1".
- Now when drbdsetup runs to initialize the device, the command doesn't exit, and there's some crash information in syslog.
- I can also see a command "/sbin/modprobe -q -- sha1_all" being run that also doesn't seem to exit, which on the DRBD peer node (identical setup) exits immediately.
- I also see errors about VIA Padlock devices not existing.

My assumption is that because there's some bad data on the block device underlying the DRBD resource, DRBD tries to check the device, tries to use SHA1 as configured, but somehow loading the SHA1 module deadlocks with whatever DRBD is trying to do. If I remove the kernel module "padlock-sha" and hard reboot, everything works as expected.

This is what's in syslog:

Jan 16 13:29:21 htz0 kernel: [ 178.247889] BUG: soft lockup - CPU#7 stuck for 61s! [kstop/7:1657]
Jan 16 13:29:21 htz0 kernel: [ 178.248206] Modules linked in: padlock_sha(-) sha1_generic drbd ipt_REJECT ipt_LOG xt_limit xt_tcpudp ipt_addrtype xt_state ip6table_filter ip6_tables nf_nat_irc nf_conntrack_irc nf_nat_ftp nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack_ftp nf_conntrack iptable_filter ip_tables x_tables fbcon tileblit font lp bitblit softcursor vga16fb video parport xhci vgastate output multipath linear aacraid 3w_9xxx 3w_xxxx raid10 raid456 async_pq async_xor xor async_memcpy async_raid6_recov raid6_pq async_tx raid1 raid0 sata_nv r8169 ahci mii sata_sil sata_via
Jan 16 13:29:21 htz0 kernel: [ 178.248245] CPU 7:
Jan 16 13:29:21 htz0 kernel: [ 178.248247] Modules linked in: padlock_sha(-) sha1_generic drbd ipt_REJECT ipt_LOG xt_limit xt_tcpudp ipt_addrtype xt_state ip6table_filter ip6_tables nf_nat_irc nf_conntrack_irc nf_nat_ftp nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack_ftp nf_conntrack iptable_filter ip_tables x_tables fbcon tileblit font lp bitblit softcursor vga16fb video parport xhci vgastate output multipath linear aacraid 3w_9xxx 3w_xxxx raid10 raid456 async_pq async_xor xor async_memcpy async_raid6_recov raid6_pq async_tx raid1 raid0 sata_nv r8169 ahci mii sata_sil sata_via
Jan 16 13:29:21 htz0 kernel: [ 178.248282] Pid: 1657, comm: kstop/7 Not tainted 2.6.32-37-server #81-Ubuntu System Product Name
Jan 16 13:29:21 htz0 kernel: [ 178.248284] RIP: 0010:[<ffffffff810b7ca5>] [<ffffffff810b7ca5>] stop_cpu+0x85/0xf0
Jan 16 13:29:21 htz0 kernel: [ 178.248291] RSP: 0018:ffff8804170e3df0 EFLAGS: 00000293
Jan 16 13:29:21 htz0 kernel: [ 178.248293] RAX: 0000000000000001 RBX: ffff8804170e3e00 RCX: 0000000000000000
Jan 16 13:29:21 htz0 kernel: [ 178.248296] RDX: ffffffff81869248 RSI: 0000000000000100 RDI: ffffffff81869240
Jan 16 13:29:21 htz0 kernel: [ 178.248298] RBP: ffffffff81013c6e R08: ffff8804170e2000 R09: 0000000000000000
Jan 16 13:29:21 htz0 kernel: [ 178.248300] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
Jan 16 13:29:21 htz0 kernel: [ 178.248302] R13: 0000000000000001 R14: 0000000000000001 R15: 0000000000000400
Jan 16 13:29:21 htz0 kernel: [ 178.248305] FS: 0000000000000000(0000) GS:ffff88000ffc0000(0000) knlGS:0000000000000000
Jan 16 13:29:21 htz0 kernel: [ 178.248308] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
Jan 16 13:29:21 htz0 kernel: [ 178.248310] CR2: 00007fa7f1c65beb CR3: 0000000001001000 CR4: 00000000000406e0
Jan 16 13:29:21 htz0 kernel: [ 178.248312] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jan 16 13:29:21 htz0 kernel: [ 178.248314] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jan 16 13:29:21 htz0 kernel: [ 178.248316] Call Trace:
Jan 16 13:29:21 htz0 kernel: [ 178.248323] [<ffffffff81081597>] ? run_workqueue+0xc7/0x1a0
Jan 16 13:29:21 htz0 kernel: [ 178.248328] [<ffffffff81081713>] ? worker_thread+0xa3/0x110
Jan 16 13:29:21 htz0 kernel: [ 178.248332] [<ffffffff81086140>] ? autoremove_wake_function+0x0/0x40
Jan 16 13:29:21 htz0 kernel: [ 178.248337] [<ffffffff81081670>] ? worker_thread+0x0/0x110
Jan 16 13:29:21 htz0 kernel: [ 178.248340] [<ffffffff81085dc6>] ? kthread+0x96/0xa0
Jan 16 13:29:21 htz0 kernel: [ 178.248344] [<ffffffff810141aa>] ? child_rip+0xa/0x20
Jan 16 13:29:21 htz0 kernel: [ 178.248348] [<ffffffff81085d30>] ? kthread+0x0/0xa0
Jan 16 13:29:21 htz0 kernel: [ 178.248351] [<ffffffff810141a0>] ? child_rip+0x0/0x20

Hope that helps; feel free to ask for more information.

Jens

Dave Walker (davewalker)
Changed in drbd8 (Ubuntu):
importance: Undecided → High
Revision history for this message
Adam Gandelman (gandelman-a) wrote :

Hi Jens-

Thanks for reporting this. This sounds like a tough one to debug, any chance you've hit this again since reporting? I assume this is on a production system and there's no chance of you trying to reproduce on demand. In the event that you do hit this again, before rebooting, can you capture the output of 'dmesg', 'cat /proc/drbd', 'ps aux | grep "D "' and 'netstat -ntp | grep 7788' (or whatever port the resource is configured to use) from both nodes. Also syslogs from both nodes would be helpful as well as your drbd.conf (with any sensitive information edited)

When you say your device got messed up, do you mean there was divergent data on both nodes or that they disconnected and were required to carry out a normal resync?

Revision history for this message
Jens Finkhäuser (finkhaeuser-consulting) wrote :

Man, I wanted to get back to this for so long...

Yes, I can reproduce it on demand by simply rebooting the machine. Right now, there's not much of a problem with that, so if you need me to reboot this machine a few times, fine.

I'll attach the requested information today (hopefully).

Revision history for this message
Jens Finkhäuser (finkhaeuser-consulting) wrote :

So, I was trying to resync the device, get everything in a nice state, give you the requested info. Then move the padlock-sha module back, reboot, and see things fail.

Turns out that things fail sooner now. The padlock-sha module is moved to a safe place, so shouldn't be messing with things. Then I resync the drbd device, and that seems to work well enough.

After resyncing the machines, I try to switch the htz0 machine to primary mode (dual-primary mode, the goal is to run ocfs2 on top, htz1 is already in primary mode), and get this:

root@htz0 ~ # drbdadm primary r0
Command 'drbdsetup 0 primary' did not terminate within 121 seconds

I'll attach the requested info for this state.

Revision history for this message
Jens Finkhäuser (finkhaeuser-consulting) wrote :
Revision history for this message
Jens Finkhäuser (finkhaeuser-consulting) wrote :
Revision history for this message
Jens Finkhäuser (finkhaeuser-consulting) wrote :
Revision history for this message
Jens Finkhäuser (finkhaeuser-consulting) wrote :
Revision history for this message
Jens Finkhäuser (finkhaeuser-consulting) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.