4.4.0-47-generic dm_snapshot random deadlock

Bug #1645187 reported by john
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Triaged
High
Unassigned
Xenial
Triaged
High
Unassigned

Bug Description

Server's load become high with tasks in D state, no choice but to reboot the system.
Could be related to the following ? :

https://bugzilla.kernel.org/show_bug.cgi?id=119841
https://www.redhat.com/archives/dm-devel/2016-June/msg00399.html
https://patchwork.kernel.org/patch/9223697/
https://xen.crc.id.au/bugs/view.php?id=75

ii xen-hypervisor-4.4-amd64 4.4.2-0ubuntu0.14.04.7 amd64 Xen Hypervisor on AMD64
ii linux-image-extra-4.4.0-47-generic 4.4.0-47.68~14.04.1 amd64 Linux kernel extra modules for version 4.4.0 on 64 bit x86 SMP
ii linux-image-4.4.0-47-generic 4.4.0-47.68~14.04.1 amd64 Linux kernel image for version 4.4.0 on 64 bit x86 SMP

kernel messages:
----------------
[890070.994700] INFO: task blkback.3.xvda2:5756 blocked for more than 120 seconds.
[890070.994758] Not tainted 4.4.0-47-generic #68~14.04.1-Ubuntu
[890070.994806] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[890070.994884] blkback.3.xvda2 D ffff8800b7ff3928 0 5756 2 0x00000000
[890070.994890] ffff8800b7ff3928 ffffffff81e13500 ffff8800b6a4be80 ffff8800b7ff4000
[890070.994895] ffff88013a83cc18 ffff88013a83cc00 ffffffff00000000 fffffffe00000001
[890070.994898] ffff8800b7ff3940 ffffffff817fafc5 ffff8800b6a4be80 ffff8800b7ff39c0
[890070.994902] Call Trace:
[890070.994912] [<ffffffff817fafc5>] schedule+0x35/0x80
[890070.994917] [<ffffffff817fd46a>] rwsem_down_write_failed+0x1da/0x320
[890070.994923] [<ffffffff81689577>] ? push+0x47/0x50
[890070.994927] [<ffffffff81689f07>] ? dm_kcopyd_copy+0x147/0x1f0
[890070.994931] [<ffffffff813e6a53>] call_rwsem_down_write_failed+0x13/0x20
[890070.994933] [<ffffffff817fcd7d>] ? down_write+0x2d/0x40
[890070.994939] [<ffffffffc0314dae>] __origin_write+0x6e/0x210 [dm_snapshot]
[890070.994944] [<ffffffff81185625>] ? mempool_alloc_slab+0x15/0x20
[890070.994946] [<ffffffff8118574f>] ? mempool_alloc+0x5f/0x150
[890070.994949] [<ffffffffc0314fb7>] do_origin.isra.14+0x67/0x90 [dm_snapshot]
[890070.994952] [<ffffffffc0315042>] origin_map+0x62/0x80 [dm_snapshot]
[890070.994955] [<ffffffff8167f2da>] __map_bio+0x3a/0x110
[890070.994957] [<ffffffff816809c0>] __split_and_process_bio+0x240/0x3c0
[890070.994960] [<ffffffff81680baa>] dm_make_request+0x6a/0xd0
[890070.994964] [<ffffffff813aa221>] generic_make_request+0xe1/0x1a0
[890070.994968] [<ffffffff813aa357>] submit_bio+0x77/0x150
[890070.994971] [<ffffffff813a1c71>] ? bio_alloc_bioset+0x181/0x2a0
[890070.994977] [<ffffffffc02c6a1d>] dispatch_rw_block_io+0x4fd/0x9b0 [xen_blkback]
[890070.994981] [<ffffffff8101c244>] ? xen_load_sp0+0x84/0x180
[890070.994985] [<ffffffffc02c70c5>] __do_block_io_op+0x1f5/0x650 [xen_blkback]
[890070.994990] [<ffffffff810e5e18>] ? del_timer_sync+0x48/0x50
[890070.994993] [<ffffffff817fd8ab>] ? schedule_timeout+0x16b/0x2d0
[890070.994997] [<ffffffffc02c7880>] xen_blkif_schedule+0xd0/0x820 [xen_blkback]
[890070.995002] [<ffffffff810a4e1a>] ? finish_task_switch+0x7a/0x290
[890070.995004] [<ffffffff817fa969>] ? __schedule+0x359/0x980
[890070.995010] [<ffffffff810bde70>] ? prepare_to_wait_event+0xf0/0xf0
[890070.995014] [<ffffffffc02c77b0>] ? xen_blkif_be_int+0x30/0x30 [xen_blkback]
[890070.995018] [<ffffffff8109ba29>] kthread+0xc9/0xe0
[890070.995021] [<ffffffff8109b960>] ? kthread_park+0x60/0x60
[890070.995025] [<ffffffff817febcf>] ret_from_fork+0x3f/0x70
[890070.995027] [<ffffffff8109b960>] ? kthread_park+0x60/0x60

CVE References

Revision history for this message
john (jcbsth) wrote :
description: updated
Revision history for this message
john (jcbsth) wrote :

this regression can be reproduced:

- with 15 VPS
- under heavy I/O
- with repeated snapshot / fsck / data read

the system locks up in about 3 hours

full stack trace attached, please consider this a serious regression

most likely this upstream bug : https://bugzilla.kernel.org/show_bug.cgi?id=119841

Revision history for this message
Stefan Bader (smb) wrote :

Had a quick look at this. In the discussion in the bugzilla there seem to be two patches referenced one which is considered too intrusive (but seems to help with the reported problem) and another one (which has a patchworks reference) that seemed to be not helping. Neither of both is yet committed upstream.

affects: linux-lts-xenial (Ubuntu) → linux (Ubuntu)
Changed in linux (Ubuntu):
importance: Undecided → Medium
status: New → Triaged
importance: Medium → High
Changed in linux (Ubuntu Xenial):
status: New → Triaged
importance: Undecided → High
tags: added: kernel-da-key xenial
Revision history for this message
john (jcbsth) wrote :

after applying the 3 patchset from :
https://www.redhat.com/archives/dm-devel/2016-June/msg00399.html
https://www.redhat.com/archives/dm-devel/2016-June/msg00400.html
https://www.redhat.com/archives/dm-devel/2016-June/msg00401.html

building the ubuntu kernel fails :
apt-get source -b linux-image-4.4.0-47-generic

checking whether DECLARE_EVENT_CLASS() is available... no
checking whether current->bio_tail exists... no
checking whether current->bio_list exists... configure: error: no - Please file a bug report at
       https://github.com/zfsonlinux/zfs/issues/new
debian/rules.d/2-binary-arch.mk:70: recipe for target '/root/linux-4.4.0/debian/stamps/stamp-build-generic' failed
make: *** [/root/linux-4.4.0/debian/stamps/stamp-build-generic] Error 1

Revision history for this message
john (jcbsth) wrote :
Revision history for this message
Stefan Bader (smb) wrote :

Would be possible it seems. Could you confirm if the kernel from http://people.canonical.com/~smb/lp1645187/ (which has the one patch from comment #5 applied) does fix the deadlock situation?

General note: Given this came from 4.11 and refers to a 3.10 change, this would be required in 14.04/Trusty onwards.

Revision history for this message
john (jcbsth) wrote :

patched upstream in 4.4.55, please upgrade / backport :

https://cdn.kernel.org/pub/linux/kernel/v4.x/ChangeLog-4.4.55

commit cd8ad4d9eb6d9ee04e77b42c6a7a15eabada85ac
Author: Mikulas Patocka <email address hidden>
Date: Wed Feb 15 11:26:10 2017 -0500

    dm: flush queued bios when process blocks to avoid deadlock

    commit d67a5f4b5947aba4bfe9a80a2b86079c215ca755 upstream.

Revision history for this message
Tim Gardner (timg-tpi) wrote :

Stable update v4.4.55 should show up in Ubuntu-4.4.0-71.92

Revision history for this message
john (jcbsth) wrote :

changelog only shows :

linux (4.4.0-71.92) xenial; urgency=low

  * CVE-2017-7184
    - xfrm_user: validate XFRM_MSG_NEWAE XFRMA_REPLAY_ESN_VAL replay_window
    - xfrm_user: validate XFRM_MSG_NEWAE incoming ESN size harder

 -- Thadeu Lima de Souza Cascardo <email address hidden> Fri, 24 Mar 2017 09:32:49 -0300

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.