Ubuntu16.10-KVM:Big configuration with multiple guests running SRIOV VFs caused KVM host hung and all KVM guests down.

Bug #1651248 reported by bugproxy on 2016-12-19
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
High
Tim Gardner
Yakkety
Undecided
Tim Gardner
Zesty
High
Tim Gardner

Bug Description

Hello Canonical.

This appears to be related to LP1636330

There are additional patches we would like picked up once they accepted upstream:

http://www.spinics.net/lists/kvm/msg141747.html

[PATCH 1/5] KVM: PPC: XICS cleanup: remove XICS_RM_REJECT
[PATCH 2/5] KVM: PPC: XICS: correct the real mode ICP rejecting counter
[PATCH 3/5] KVM: PPC: XICS: Fix potential resend twice issue
[PATCH 4/5] KVM: PPC: XICS: Implement ICS P/Q states
[PATCH 5/5] KVM: PPC: XICS: Don't lock twice when doing check resend

Bug history context:

== Comment: #1 - Application Cdeadmin - 2016-10-17 11:05:07 ==
==== State: Open by: nguyenp on 17 October 2016 09:54:13 ====

Ubuntu16.10-KVM host:
=====================
tul612p01.aus.stglabs.ibm.com

Problem Description:
===================
I set up a big configure with multiple guests running SRIOV on tul612fp1 system. For VFs, I created 64 of them on Ubuntu16.10-KVM host (32 VFs per port). I have 6 guests total (mix of ubuntu16.10 and ubuntu16.04.01 guests). I assigned 4 VFs for each guest.

I kicked off network tests on VFs on all guests late Friday night. The test ran good on Friday night then it all failed on sometime on Saturday night. Right now, all guests are currently are down. Another bad part is that the Ubuntu-KVM host is currently hung. So the whole system is very much down. I could not get to neither the KVM host nor any guest.

I have talked to my team leads, Siraj I. and Brian C. about the problem this morning. They want me to open the defect so that I/O developer to looking into the SRIOV problem on Ubuntu16.10-KVM again.

== Comment: #11 - Paul Nguyen - 2016-10-18 17:55:40 ==
Worked with Carol earlier today on the SRIOV configuration on this tul612 system. Thanks for your help, Carol !

I just want to document the differences of this configuration .vs. previous run configuration. So that we can keep track of what we have tried.

Here are the differences:

1) In this configuration, Carol installed patch driver on KVM host to avoid the BUG_ON.

2) To avoid missing interrupt issue, Carol told me to reduce the number of VFs to 16 VFs per port. (on previous run, I had 32 VFs per port which was 64 VFs total for 2 ports of the adapter)

3) Also Carol wanted to reduce number of guests down to 4 guests.
  (on previous run, I had 6 guests running and 4 VFs per guest)

4) Carol also configured NUMA for the 4 guests.

  <vcpu placement='static' cpuset='0,8,16,24'>16</vcpu>
  <numatune>
    <memory mode='strict' nodeset='0'/>
  </numatune>

All SRIOV tests have been running good for about 3 hours already. Let's see how long it hit the problem.

== Comment: #12 - Paul Nguyen - 2016-10-19 10:15:50 ==
I have 4 guests running SRIOV and 2 guests hit with TX timeout problem last night. The other 2 guests seem running fine.

tul612p05_unbuntu1610 guest:
============================
tul612p05.aus.stglabs.ibm.com

[Tue Oct 18 15:15:42 2016] nfsd: peername failed (err 107)!
[Tue Oct 18 16:04:06 2016] hrtimer: interrupt took 12062308 ns
[Wed Oct 19 00:28:52 2016] NETDEV WATCHDOG: enp0s4 (mlx5_core): transmit queue 2 timed out
[Wed Oct 19 00:28:52 2016] ------------[ cut here ]------------
[Wed Oct 19 00:28:52 2016] WARNING: CPU: 13 PID: 28503 at /build/linux-PXWOzW/linux-4.8.0/net/sched/sch_generic.c:316 dev_watchdog+0x380/0x390
[Wed Oct 19 00:28:52 2016] Modules linked in: iptable_filter rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) configfs ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) mlx4_ib(OE) ib_sa(OE) ib_mad(OE) ib_core(OE) ib_addr(OE) ib_netlink(OE) mlx4_en(OE) mlx4_core(OE) vmx_crypto binfmt_misc nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables x_tables autofs4 ibmvscsi crc32c_vpmsum mlx5_core(OE) mlx_compat(OE)
[Wed Oct 19 00:28:52 2016] CPU: 13 PID: 28503 Comm: hxecpu Tainted: G OE 4.8.0-22-generic #24-Ubuntu
[Wed Oct 19 00:28:52 2016] task: c0000001db237000 task.stack: c0000001db294000
[Wed Oct 19 00:28:52 2016] NIP: c000000000a4e7a0 LR: c000000000a4e79c CTR: 0000000000000000
[Wed Oct 19 00:28:52 2016] REGS: c0000001db297820 TRAP: 0700 Tainted: G OE (4.8.0-22-generic)
[Wed Oct 19 00:28:52 2016] MSR: 8000000000029033 <SF,EE,ME,IR,DR,RI,LE> CR: 22000822 XER: 20000000
[Wed Oct 19 00:28:52 2016] CFAR: c000000000b9a6b4 SOFTE: 1
                           GPR00: c000000000a4e79c c0000001db297aa0 c0000000014f6600 000000000000003f
                           GPR04: c0000001ffe4aca0 c0000001ffe5fb40 c00000000169d278 0000000000005bd8
                           GPR08: 0000000000000007 c00000000102f774 00000001fee20000 0000000000000001
                           GPR12: 0000000000000000 c00000000fb87500 c0000001db294000 0000000000000000
                           GPR16: c000000001042700 00003fff79b04390 00000001008356c1 000000000000000a
                           GPR20: c000000001042700 c00000000102a908 c0000001db294000 c000000001522100
                           GPR24: 0000000000000000 ffffffffffffffff 0000000000000000 000000000000000d
                           GPR28: 0000000000000004 c0000001ec5a0000 c000000001522200 0000000000000002
[Wed Oct 19 00:28:52 2016] NIP [c000000000a4e7a0] dev_watchdog+0x380/0x390
[Wed Oct 19 00:28:52 2016] LR [c000000000a4e79c] dev_watchdog+0x37c/0x390
[Wed Oct 19 00:28:52 2016] Call Trace:
[Wed Oct 19 00:28:52 2016] [c0000001db297aa0] [c000000000a4e79c] dev_watchdog+0x37c/0x390 (unreliable)
[Wed Oct 19 00:28:52 2016] [c0000001db297b40] [c000000000165ae4] call_timer_fn+0x64/0x1d0
[Wed Oct 19 00:28:52 2016] [c0000001db297bd0] [c000000000165dd0] expire_timers+0x140/0x200
[Wed Oct 19 00:28:52 2016] [c0000001db297c40] [c000000000165f58] run_timer_softirq+0xc8/0x230
[Wed Oct 19 00:28:52 2016] [c0000001db297ce0] [c000000000b9865c] __do_softirq+0x18c/0x3fc
[Wed Oct 19 00:28:52 2016] [c0000001db297de0] [c0000000000d5848] irq_exit+0xc8/0x100
[Wed Oct 19 00:28:52 2016] [c0000001db297e00] [c000000000024810] timer_interrupt+0xa0/0xe0
[Wed Oct 19 00:28:52 2016] [c0000001db297e30] [c000000000002814] decrementer_common+0x114/0x180
[Wed Oct 19 00:28:52 2016] Instruction dump:
[Wed Oct 19 00:28:52 2016] 3d42fff5 7fa3eb78 992a60cb 4bfc8c0d 60000000 7fe6fb78 7fa4eb78 7c651b78
[Wed Oct 19 00:28:52 2016] 3c62ff99 38632c50 4814beb5 60000000 <0fe00000> 4bffffa0 60000000 60420000
[Wed Oct 19 00:28:52 2016] ---[ end trace 40487e0957f79fba ]---
[Wed Oct 19 00:28:52 2016] mlx5_core 0000:00:04.0 enp0s4: TX timeout detected
[Wed Oct 19 00:28:52 2016] mlx5_core 0000:00:04.0 enp0s4: TX timeout on queue: 2, SQ: 0xab, CQ: 0x17, SQ Cons: 0x3a23 SQ Prod: 0x3e04

tul612p06_unbuntu1610 guest:
============================
tul612p06.aus.stglabs.ibm.com

[Tue Oct 18 17:42:01 2016] hrtimer: interrupt took 16906112 ns
[Tue Oct 18 19:55:30 2016] INFO: rcu_sched self-detected stall on CPU
[Tue Oct 18 19:55:30 2016] INFO: rcu_sched detected stalls on CPUs/tasks:

[Tue Oct 18 19:55:30 2016] 1-...: (1 GPs behind) idle=4a1/140000000000002/0 softirq=1893544/1893549 fqs=2463
[Tue Oct 18 19:55:30 2016]
[Tue Oct 18 19:55:30 2016] (detected by 3, t=5252 jiffies, g=593501, c=593500, q=5696)
[Tue Oct 18 19:55:30 2016] Task dump for CPU 1:
[Tue Oct 18 19:55:30 2016] hxecom R
[Tue Oct 18 19:55:30 2016] running task
[Tue Oct 18 19:55:30 2016] 0 21673 21303 0x00040004
[Tue Oct 18 19:55:30 2016] Call Trace:
[Tue Oct 18 19:55:30 2016] [c0000001ec267af0] [c0000001ec267b30] 0xc0000001ec267b30
[Tue Oct 18 19:55:30 2016] (unreliable)

[Tue Oct 18 19:55:30 2016] 1-...: (1 GPs behind) idle=4a1/140000000000002/0 softirq=1893544/1893549 fqs=2472
[Tue Oct 18 19:55:30 2016] (t=5277 jiffies g=593501 c=593500 q=5708)
[Tue Oct 18 19:55:30 2016] Task dump for CPU 1:
[Tue Oct 18 19:55:30 2016] hxecom R running task 0 21673 21303 0x00040004
[Tue Oct 18 19:55:30 2016] Call Trace:
[Tue Oct 18 19:55:30 2016] [c0000001fffe3320] [c000000000111900] sched_show_task+0xe0/0x180 (unreliable)
[Tue Oct 18 19:55:30 2016] [c0000001fffe3390] [c000000000b9a98c] rcu_dump_cpu_stacks+0xec/0x120
[Tue Oct 18 19:55:30 2016] [c0000001fffe33e0] [c00000000015f1e0] rcu_check_callbacks+0x930/0xb30
[Tue Oct 18 19:55:30 2016] [c0000001fffe3510] [c000000000167688] update_process_times+0x48/0x90
[Tue Oct 18 19:55:30 2016] [c0000001fffe3540] [c00000000017ef58] tick_sched_handle.isra.4+0x48/0xe0
[Tue Oct 18 19:55:30 2016] [c0000001fffe3580] [c00000000017f054] tick_sched_timer+0x64/0xd0
[Tue Oct 18 19:55:30 2016] [c0000001fffe35c0] [c0000000001686b4] __hrtimer_run_queues+0x124/0x420
[Tue Oct 18 19:55:30 2016] [c0000001fffe3650] [c00000000016933c] hrtimer_interrupt+0xec/0x2b0
[Tue Oct 18 19:55:30 2016] [c0000001fffe3710] [c0000000000243ec] __timer_interrupt+0x8c/0x270
[Tue Oct 18 19:55:30 2016] [c0000001fffe3760] [c00000000002480c] timer_interrupt+0x9c/0xe0
[Tue Oct 18 19:55:30 2016] [c0000001fffe3790] [c000000000002814] decrementer_common+0x114/0x180
[Tue Oct 18 19:55:30 2016] --- interrupt: 901 at plpar_hcall_norets+0x1c/0x28
                               LR = tce_freemulti_pSeriesLP+0x60/0x130
[Tue Oct 18 19:55:30 2016] [c0000001fffe3a80] [c00000000010fdb8] scheduler_tick+0x148/0x180 (unreliable)
[Tue Oct 18 19:55:30 2016] [c0000001fffe3b00] [c000000000046618] __iommu_free+0x108/0x220
[Tue Oct 18 19:55:30 2016] [c0000001fffe3b90] [c0000000000476cc] ppc_iommu_unmap_sg+0xcc/0x140
[Tue Oct 18 19:55:30 2016] [c0000001fffe3be0] [c000000000045acc] dma_iommu_unmap_sg+0x4c/0x70
[Tue Oct 18 19:55:30 2016] [c0000001fffe3c30] [c0000000007e2820] scsi_dma_unmap+0x80/0xb0
[Tue Oct 18 19:55:30 2016] [c0000001fffe3c60] [d000000001f8156c] unmap_cmd_data.isra.5.part.6+0x34/0x60 [ibmvscsi]
[Tue Oct 18 19:55:30 2016] [c0000001fffe3c90] [d000000001f83b78] handle_cmd_rsp+0x100/0x170 [ibmvscsi]
[Tue Oct 18 19:55:30 2016] [c0000001fffe3d10] [d000000001f82e0c] ibmvscsi_handle_crq+0x2e4/0x3f0 [ibmvscsi]
[Tue Oct 18 19:55:30 2016] [c0000001fffe3d90] [d000000001f82f74] ibmvscsi_task+0x5c/0xe0 [ibmvscsi]
[Tue Oct 18 19:55:30 2016] [c0000001fffe3de0] [c0000000000d60d8] tasklet_action+0xd8/0x1c0
[Tue Oct 18 19:55:30 2016] [c0000001fffe3e40] [c000000000b9865c] __do_softirq+0x18c/0x3fc
[Tue Oct 18 19:55:30 2016] [c0000001fffe3f40] [c0000000000d5848] irq_exit+0xc8/0x100
[Tue Oct 18 19:55:30 2016] [c0000001fffe3f60] [c00000000001593c] __do_irq+0x8c/0x1d0
[Tue Oct 18 19:55:30 2016] [c0000001fffe3f90] [c000000000029780] call_do_irq+0x14/0x24
[Tue Oct 18 19:55:30 2016] [c0000001ec267de0] [c000000000015b18] do_IRQ+0x98/0x140
[Tue Oct 18 19:55:30 2016] [c0000001ec267e30] [c000000000002694] hardware_interrupt_common+0x114/0x180
[Tue Oct 18 19:55:32 2016] NETDEV WATCHDOG: enp0s4 (mlx5_core): transmit queue 2 timed out
[Tue Oct 18 19:55:32 2016] ------------[ cut here ]------------
[Tue Oct 18 19:55:32 2016] WARNING: CPU: 13 PID: 16322 at /build/linux-PXWOzW/linux-4.8.0/net/sched/sch_generic.c:316 dev_watchdog+0x380/0x390
[Tue Oct 18 19:55:32 2016] Modules linked in: iptable_filter rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) configfs ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) mlx4_ib(OE) ib_sa(OE) ib_mad(OE) ib_core(OE) ib_addr(OE) ib_netlink(OE) mlx4_en(OE) mlx4_core(OE) binfmt_misc vmx_crypto nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables x_tables autofs4 ibmvscsi mlx5_core(OE) crc32c_vpmsum mlx_compat(OE)
[Tue Oct 18 19:55:32 2016] CPU: 13 PID: 16322 Comm: hxecpu Tainted: G OE 4.8.0-22-generic #24-Ubuntu
[Tue Oct 18 19:55:32 2016] task: c0000001e0384200 task.stack: c0000001e0360000
[Tue Oct 18 19:55:32 2016] NIP: c000000000a4e7a0 LR: c000000000a4e79c CTR: 0000000000000000
[Tue Oct 18 19:55:32 2016] REGS: c0000001e0363820 TRAP: 0700 Tainted: G OE (4.8.0-22-generic)
[Tue Oct 18 19:55:32 2016] MSR: 8000000000029033 <SF,EE,ME,IR,DR,RI,LE> CR: 22004822 XER: 20000000
[Tue Oct 18 19:55:32 2016] CFAR: c000000000b9a6b4 SOFTE: 1
                           GPR00: c000000000a4e79c c0000001e0363aa0 c0000000014f6600 000000000000003f
                           GPR04: c0000001ffe4aca0 c0000001ffe5fb40 c00000000169d278 0000000000006cf8
                           GPR08: 0000000000000007 c00000000102f774 00000001fee20000 0000000000000001
                           GPR12: 0000000000004400 c00000000fb87500 c0000001e0360000 0000000000000000
                           GPR16: c000000001042700 00003fff9acf4390 0000000100435141 000000000000000a
                           GPR20: c000000001042700 c00000000102a908 c0000001e0360000 c000000001522100
                           GPR24: 0000000000000000 ffffffffffffffff 0000000000000000 000000000000000d
                           GPR28: 0000000000000004 c0000001ed1a0000 c000000001522200 0000000000000002
[Tue Oct 18 19:55:32 2016] NIP [c000000000a4e7a0] dev_watchdog+0x380/0x390
[Tue Oct 18 19:55:32 2016] LR [c000000000a4e79c] dev_watchdog+0x37c/0x390
[Tue Oct 18 19:55:32 2016] Call Trace:
[Tue Oct 18 19:55:32 2016] [c0000001e0363aa0] [c000000000a4e79c] dev_watchdog+0x37c/0x390 (unreliable)
[Tue Oct 18 19:55:32 2016] [c0000001e0363b40] [c000000000165ae4] call_timer_fn+0x64/0x1d0
[Tue Oct 18 19:55:32 2016] [c0000001e0363bd0] [c000000000165dd0] expire_timers+0x140/0x200
[Tue Oct 18 19:55:32 2016] [c0000001e0363c40] [c000000000165f58] run_timer_softirq+0xc8/0x230
[Tue Oct 18 19:55:32 2016] [c0000001e0363ce0] [c000000000b9865c] __do_softirq+0x18c/0x3fc
[Tue Oct 18 19:55:32 2016] [c0000001e0363de0] [c0000000000d5848] irq_exit+0xc8/0x100
[Tue Oct 18 19:55:32 2016] [c0000001e0363e00] [c000000000024810] timer_interrupt+0xa0/0xe0
[Tue Oct 18 19:55:32 2016] [c0000001e0363e30] [c000000000002814] decrementer_common+0x114/0x180
[Tue Oct 18 19:55:32 2016] Instruction dump:
[Tue Oct 18 19:55:32 2016] 3d42fff5 7fa3eb78 992a60cb 4bfc8c0d 60000000 7fe6fb78 7fa4eb78 7c651b78
[Tue Oct 18 19:55:32 2016] 3c62ff99 38632c50 4814beb5 60000000 <0fe00000> 4bffffa0 60000000 60420000
[Tue Oct 18 19:55:32 2016] ---[ end trace 1f642883451e0bbd ]---
[Tue Oct 18 19:55:32 2016] mlx5_core 0000:00:04.0 enp0s4: TX timeout detected
[Tue Oct 18 19:55:32 2016] mlx5_core 0000:00:04.0 enp0s4: TX timeout on queue: 2, SQ: 0xab, CQ: 0x17, SQ Cons: 0x120e SQ Prod: 0x15ef
[htx@tul612p06] [1m/usr/lpp/htx/bin# [0m

== Comment: #13 - Carol L. Soto - 2016-10-19 11:20:01 ==
I grabbed dmesg from KVM and the 2 guests. Dmesg shows the timeout but I think the interrupt are ok because htxerr is clean in both guests and netstat -in shows all 4 VFs running.
I will add the dmesg to this bugzilla and to the other bugzilla where we are debugging this for Paul.
I think u can continue running to see if u hit something else.

== Comment: #22 - Carol L. Soto - 2016-10-24 18:00:51 ==

>
> Per comment #19:
> - I rebooted the tul612p01 Ubuntu1610-KVM host. You could not log into the
> host because tt was hung.
>
> - The tul612p01 KVM host is rebooted and up. You can log in to it. Let me
> know what you need me to do.

I looked at the logs and I think this time it ran from oct18 thru Oct 21 12:55:29 when it hit the FW issue at the card.
Guest tul612p02 - I am not sure what happened could not find the logs

Guest tul612p03 - it only hit the FW issue so this one did not hit the command timeout

Guest tul612p05 -
Oct 19 00:28:52 - enp0s4 hit the transmit timeout but I think driver recovered and continue running.
Oct 21 07:19:28 - device 0000:00:04.0 hit the command timeout
Oct 21 12:55:30 - is when everybody hit the FW issue.

Guest tul612p06
Oct 18 19:55:32 - enp0s4 hit transmit timeout but driver recovered and continue running.
Oct 21 12:54:22 - device 0000:00:06.0 command timeout
Oct 21 12:55:31 - is when everybody hit the FW issue

From htx point of view it will stop when the FW dies, the command timeout will not cause HTX to stop.

I changed 2 virtual guests. I brought up 2 guests with 8 VFs. I pinned the cpus a little different this time I would like to see how long it runs this way. I saved the xmls that we used before in /etc/libvirt/qemu/*.oct18-21. I also updated the kernel in those 2 systems to latest kernel. the guests are
virsh list
 Id Name State
----------------------------------------------------
 9 tul612p05_ubuntu1610 running
 10 tul612p06_ubuntu1610 running

Since I added VFs to the guests you will need to redo the htx setup.

CVE References

Default Comment by Bridge

tags: added: architecture-ppc64le bugnameltc-147640 severity-critical targetmilestone-inin1610

Default Comment by Bridge

Default Comment by Bridge

Default Comment by Bridge

Changed in ubuntu:
assignee: nobody → Taco Screen team (taco-screen-team)
affects: ubuntu → linux (Ubuntu)

------- Comment From <email address hidden> 2016-12-19 16:55 EDT-------
cde00 (<email address hidden>) added native attachment /tmp/AIXOS06645505/traces.out on 2016-12-19 15:33:09

== Comment: #1 - Application Cdeadmin - 2016-10-17 11:05:07 ====== State: Assigned by: cde00 on 19 December 2016 15:48:02 ====

bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-12-19 17:03 EDT-------
==== State: Assigned by: ruddk on 19 December 2016 15:58:06 ====

#=#=# 2016-12-19 15:57:42 (CST) #=#=#
#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#

Changed in linux (Ubuntu):
assignee: Taco Screen team (taco-screen-team) → Canonical Kernel Team (canonical-kernel-team)
importance: Undecided → High
status: New → Triaged
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2017-01-11 15:09 EDT-------
Paul - since this is tracking the interrupt race condition fixes coming from your team, reassigning.

bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2017-02-07 12:08 EDT-------
*** Bug 151284 has been marked as a duplicate of this bug. ***

bugproxy (bugproxy) on 2017-02-08
tags: added: targetmilestone-inin1704
removed: targetmilestone-inin1610
Tim Gardner (timg-tpi) wrote :
Changed in linux (Ubuntu Yakkety):
assignee: nobody → Tim Gardner (timg-tpi)
status: New → In Progress
Changed in linux (Ubuntu Zesty):
assignee: Canonical Kernel Team (canonical-kernel-team) → Tim Gardner (timg-tpi)
status: Triaged → In Progress
Tim Gardner (timg-tpi) on 2017-02-08
Changed in linux (Ubuntu Zesty):
status: In Progress → Fix Committed
Tim Gardner (timg-tpi) on 2017-02-16
Changed in linux (Ubuntu Yakkety):
status: In Progress → Fix Committed

Default Comment by Bridge

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 4.10.0-8.10

---------------
linux (4.10.0-8.10) zesty; urgency=low

  [ Tim Gardner ]

  * Release Tracking Bug
    - LP: #1664217

  * [Hyper-V] Bug fixes for storvsc (tagged queuing, error conditions)
    (LP: #1663687)
    - scsi: storvsc: Enable tracking of queue depth
    - scsi: storvsc: Remove the restriction on max segment size
    - scsi: storvsc: Enable multi-queue support
    - scsi: storvsc: use tagged SRB requests if supported by the device
    - scsi: storvsc: properly handle SRB_ERROR when sense message is present
    - scsi: storvsc: properly set residual data length on errors

  * Ubuntu16.10-KVM:Big configuration with multiple guests running SRIOV VFs
    caused KVM host hung and all KVM guests down. (LP: #1651248)
    - KVM: PPC: Book 3S: XICS cleanup: remove XICS_RM_REJECT
    - KVM: PPC: Book 3S: XICS: correct the real mode ICP rejecting counter
    - KVM: PPC: Book 3S: XICS: Fix potential issue with duplicate IRQ resends
    - KVM: PPC: Book 3S: XICS: Implement ICS P/Q states
    - KVM: PPC: Book 3S: XICS: Don't lock twice when checking for resend

  * overlay: mkdir fails if directory exists in lowerdir in a user namespace
    (LP: #1531747)
    - SAUCE: overlayfs: Skip permission checking for trusted.overlayfs.* xattrs

  * CVE-2016-1575 (LP: #1534961)
    - SAUCE: overlayfs: Skip permission checking for trusted.overlayfs.* xattrs

  * CVE-2016-1576 (LP: #1535150)
    - SAUCE: overlayfs: Skip permission checking for trusted.overlayfs.* xattrs

  * Miscellaneous Ubuntu changes
    - SAUCE: md/raid6 algorithms: scale test duration for speedier boots
    - SAUCE: Import aufs driver
    - d-i: Build message-modules udeb for arm64
    - rebase to v4.10-rc8

  * Miscellaneous upstream changes
    - Revert "UBUNTU: SAUCE: aufs -- remove .readlink assignment"
    - Revert "UBUNTU: SAUCE: (no-up) aufs: for v4.9-rc1, support setattr_prepare()"
    - Revert "UBUNTU: SAUCE: aufs -- Add flags argument to aufs_rename()"
    - Revert "UBUNTU: SAUCE: aufs -- Convert to use xattr handlers"
    - Revert "UBUNTU: SAUCE: Import aufs driver"

  [ Upstream Kernel Changes ]

  * rebase to v4.10-rc8

 -- Tim Gardner <email address hidden> Mon, 06 Feb 2017 08:34:24 -0700

Changed in linux (Ubuntu Zesty):
status: Fix Committed → Fix Released
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-yakkety' to 'verification-done-yakkety'. If the problem still exists, change the tag 'verification-needed-yakkety' to 'verification-failed-yakkety'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-yakkety
Launchpad Janitor (janitor) wrote :
Download full text (6.0 KiB)

This bug was fixed in the package linux - 4.8.0-40.43

---------------
linux (4.8.0-40.43) yakkety; urgency=low

  * linux: 4.8.0-40.43 -proposed tracker (LP: #1667066)

  [ Andy Whitcroft ]
  * NFS client : permission denied when trying to access subshare, since kernel
    4.4.0-31 (LP: #1649292)
    - fs: Better permission checking for submounts

  * shaking screen (LP: #1651981)
    - drm/radeon: drop verde dpm quirks

  * [0bda:0328] Card reader failed after S3 (LP: #1664809)
    - usb: hub: Wait for connection to be reestablished after port reset

  * linux-lts-xenial 4.4.0-63.84~14.04.2 ADT test failure with linux-lts-xenial
    4.4.0-63.84~14.04.2 (LP: #1664912)
    - SAUCE: apparmor: fix link auditing failure due to, uninitialized var

  * In Ubuntu 17.04 : after reboot getting message in console like Unable to
    open file: /etc/keys/x509_ima.der (-2) (LP: #1656908)
    - SAUCE: ima: Downgrade error to warning

  * 16.04.2: Extra patches for POWER9 (LP: #1664564)
    - powerpc/mm: Fix no execute fault handling on pre-POWER5
    - powerpc/mm: Fix spurrious segfaults on radix with autonuma

  * ibmvscsis: Add SGL LIMIT (LP: #1662551)
    - ibmvscsis: Add SGL limit

  * [Hyper-V] Bug fixes for storvsc (tagged queuing, error conditions)
    (LP: #1663687)
    - scsi: storvsc: Enable tracking of queue depth
    - scsi: storvsc: Remove the restriction on max segment size
    - scsi: storvsc: Enable multi-queue support
    - scsi: storvsc: use tagged SRB requests if supported by the device
    - scsi: storvsc: properly handle SRB_ERROR when sense message is present
    - scsi: storvsc: properly set residual data length on errors

  * Ubuntu16.10-KVM:Big configuration with multiple guests running SRIOV VFs
    caused KVM host hung and all KVM guests down. (LP: #1651248)
    - KVM: PPC: Book 3S: XICS cleanup: remove XICS_RM_REJECT
    - KVM: PPC: Book 3S: XICS: correct the real mode ICP rejecting counter
    - KVM: PPC: Book 3S: XICS: Fix potential issue with duplicate IRQ resends
    - KVM: PPC: Book 3S: XICS: Implement ICS P/Q states
    - KVM: PPC: Book 3S: XICS: Don't lock twice when checking for resend

  * ISST-LTE:pNV: ppc64_cpu command is hung w HDs, SSDs and NVMe (LP: #1662666)
    - blk-mq: Avoid memory reclaim when remapping queues
    - blk-mq: Fix failed allocation path when mapping queues
    - blk-mq: Always schedule hctx->next_cpu

  * systemd-udevd hung in blk_mq_freeze_queue_wait testing unpartitioned NVMe
    drive (LP: #1662673)
    - percpu-refcount: fix reference leak during percpu-atomic transition

  * [Yakkety SRU] Enable KEXEC support in ARM64 kernel (LP: #1662554)
    - [Config] Enable KEXEC support in ARM64.

  * [Hyper-V] Fix ring buffer handling to avoid host throttling (LP: #1661430)
    - Drivers: hv: vmbus: On write cleanup the logic to interrupt the host
    - Drivers: hv: vmbus: On the read path cleanup the logic to interrupt the host
    - Drivers: hv: vmbus: finally fix hv_need_to_signal_on_read()

  * brd module compiled as built-in (LP: #1593293)
    - CONFIG_BLK_DEV_RAM=m

  * regession tests failing after stackprofile test is run (LP: #1661030)
    - SAUCE: fix regression with domain change in compla...

Read more...

Changed in linux (Ubuntu Yakkety):
status: Fix Committed → Fix Released
bugproxy (bugproxy) on 2017-03-06
tags: added: verification-done-yakkety
removed: verification-needed-yakkety

------- Comment From <email address hidden> 2017-03-07 09:03 EDT-------
Making Carol's comment public

-------------------------------

I ran for 5 days with 4.8.0-40-generic and 2 guests with 8 VFs each guest and it ran ok.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers