[Hyper-V] Crash in hot-add/remove scsi devices (smp)

Bug #1509029 reported by Joshua R. Poulson on 2015-10-22
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Medium
Joseph Salisbury
Trusty
Medium
Joseph Salisbury
Vivid
Medium
Joseph Salisbury
Wily
Medium
Joseph Salisbury
Xenial
Medium
Joseph Salisbury

Bug Description

On some host errors storvsc module tries to remove sdev by scheduling a job
which does the following:

   sdev = scsi_device_lookup(wrk->host, 0, 0, wrk->lun);
   if (sdev) {
       scsi_remove_device(sdev);
       scsi_device_put(sdev);
   }

While this code seems correct the following crash is observed:

 general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC
 RIP: 0010:[<ffffffff81169979>] [<ffffffff81169979>] bdi_destroy+0x39/0x220
 ...
 [<ffffffff814aecdc>] ? _raw_spin_unlock_irq+0x2c/0x40
 [<ffffffff8127b7db>] blk_cleanup_queue+0x17b/0x270
 [<ffffffffa00b54c4>] __scsi_remove_device+0x54/0xd0 [scsi_mod]
 [<ffffffffa00b556b>] scsi_remove_device+0x2b/0x40 [scsi_mod]
 [<ffffffffa00ec47d>] storvsc_remove_lun+0x3d/0x60 [hv_storvsc]
 [<ffffffff81080791>] process_one_work+0x1b1/0x530
 ...

The problem comes with the fact that many such jobs (for the same device)
are being scheduled simultaneously. While scsi_remove_device() uses
shost->scan_mutex and scsi_device_lookup() will fail for a device in
SDEV_DEL state there is no protection against someone who did
scsi_device_lookup() before we actually entered __scsi_remove_device(). So
the whole scenario looks like that: two callers do simultaneous (or
preemption happens) calls to scsi_device_lookup() ant these calls succeed
for all of them, after that both callers try doing scsi_remove_device().
shost->scan_mutex only serializes their calls to __scsi_remove_device()
and we end up doing the cleanup path twice.

Signed-off-by: Vitaly Kuznetsov <email address hidden>
---
 drivers/scsi/scsi_sysfs.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
index b333389..e0d2707 100644
--- a/drivers/scsi/scsi_sysfs.c
+++ b/drivers/scsi/scsi_sysfs.c
@@ -1076,6 +1076,14 @@ void __scsi_remove_device(struct scsi_device *sdev)
 {
        struct device *dev = &sdev->sdev_gendev;

+ /*
+ * This cleanup path is not reentrant and while it is impossible
+ * to get a new reference with scsi_device_get() someone can still
+ * hold a previously acquired one.
+ */
+ if (sdev->sdev_state == SDEV_DEL)
+ return;
+
        if (sdev->is_visible) {
                if (scsi_device_set_state(sdev, SDEV_CANCEL) != 0)
                        return;

--
2.4.3

tags: added: kernel-da-key kernel-hyper-v
Joshua R. Poulson (jrp) on 2015-10-22
Changed in linux (Ubuntu):
status: New → Confirmed
Changed in linux (Ubuntu):
importance: Undecided → Medium
Joseph Salisbury (jsalisbury) wrote :

I don't see this patch in mainline as of yet. Is it in any upstream repos? Do you want me to build a test kernel with this patch?

Changed in linux (Ubuntu):
assignee: nobody → Joseph Salisbury (jsalisbury)
Joshua R. Poulson (jrp) wrote :

This is a sauce request. It has been submitted upstream but not yet accepted.

Joseph Salisbury (jsalisbury) wrote :

I build a Wily test kernel with the patch mentioned in the description. The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1509029/wily/

Can you test this kernel and see if it resolves this bug. Note, you need to install both the linux-image and linux-image-extra .deb packages.

Also, is this patch also needed in Vivid and Trusty?

Changed in linux (Ubuntu Wily):
importance: Undecided → Medium
status: New → In Progress
Changed in linux (Ubuntu):
status: Confirmed → In Progress
Changed in linux (Ubuntu Wily):
assignee: nobody → Joseph Salisbury (jsalisbury)
Chris Valean (cvalean) wrote :

Hi Joe,
We've tested the Wily test kernel provided and it's looking good, please merge the changes.

As for the previous releases - Trusty, Vivid - please include those as well if possible, as this patch is not conflicting with the other LIS diff patches list of inclusion.

Changed in linux (Ubuntu Vivid):
status: New → In Progress
Changed in linux (Ubuntu Trusty):
status: New → In Progress
importance: Undecided → Medium
Changed in linux (Ubuntu Vivid):
importance: Undecided → Medium
assignee: nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu Trusty):
assignee: nobody → Joseph Salisbury (jsalisbury)
Joseph Salisbury (jsalisbury) wrote :

Thanks for the test results, Chris. I also built Trusty and Vivid test kernels. The can be downloaded from:
Trusty: http://kernel.ubuntu.com/~jsalisbury/lp1509029/trusty
Vivid: ttp://kernel.ubuntu.com/~jsalisbury/lp1509029/vivid

Can you give these kernels a test as well. I'll submit an SRU request for them as well if the testing is good.

Thanks again.

Adrian Suhov (asuhov) wrote :

Hi!
I've tested both test kernels and all seems good, syslog and dmesg were fine. Please merge the changes.

Tim Gardner (timg-tpi) on 2015-11-17
Changed in linux (Ubuntu Trusty):
status: In Progress → Fix Committed
Changed in linux (Ubuntu Vivid):
status: In Progress → Fix Committed
Changed in linux (Ubuntu Wily):
status: In Progress → Fix Committed
Changed in linux (Ubuntu Xenial):
status: In Progress → Fix Committed
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 4.3.0-1.10

---------------
linux (4.3.0-1.10) xenial; urgency=low

  [ Andy Whitcroft ]

  * [Config] make IBMVETH consistent on powerpc/ppc64el
    - LP: #1521712
  * [Config] follow ibmvscsi name change
    - LP: #1521712
  * [Config] move ibm disk and ethernet drivers to linux-image
    - LP: #1521712
  * [Config] include ibmveth in nic-modules for ppc64el
    - LP: #1521712
  * [Config] s390x -- disable abi/module checks for s390x

  [ Tim Gardner ]

  * [Config] Add spl/zfs provides to generic and powerpc64-smp
  * [Config] Add zfs to d-i fs-core-modules

  [ Upstream Kernel Changes ]

  * KVM: x86: work around infinite loop in microcode when #AC is delivered
  * KVM: svm: unconditionally intercept #DB
  * Btrfs: fix truncation of compressed and inlined extents
  * staging/dgnc: fix info leak in ioctl
  * [media] media/vivid-osd: fix info leak in ioctl
  * crypto: asymmetric_keys - remove always false comparison
  * X.509: Fix the time validation [ver #2]
  * isdn_ppp: Add checks for allocation failure in isdn_ppp_open()
  * ppp, slip: Validate VJ compression slot parameters completely

 -- Andy Whitcroft <email address hidden> Tue, 01 Dec 2015 21:37:13 +0000

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-trusty' to 'verification-done-trusty'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-trusty
tags: added: verification-needed-vivid
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-vivid' to 'verification-done-vivid'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-wily
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-wily' to 'verification-done-wily'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

Joshua R. Poulson (jrp) wrote :

We'll test the kernel in -proposed right away. Thanks!

Joseph Salisbury (jsalisbury) wrote :

Have you had a chance to test -proposed as of yet so we can mark the bug as 'verification-done-wily'?

Joshua R. Poulson (jrp) wrote :

In progress. Sorry, there were a lot of -proposed requests this week.

Zsolt Dudás (v-zsduda) on 2015-12-11
tags: added: verification-done-vivid
removed: verification-needed-vivid
Zsolt Dudás (v-zsduda) on 2015-12-11
tags: added: verification-done-wily
removed: verification-needed-wily
Zsolt Dudás (v-zsduda) wrote :

Tested the -proposed kernels on all 3 releases, couldn't reproduce the bug.

tags: added: verification-done-trusty
removed: verification-needed-trusty
Launchpad Janitor (janitor) wrote :
Download full text (8.8 KiB)

This bug was fixed in the package linux - 3.13.0-73.116

---------------
linux (3.13.0-73.116) trusty; urgency=low

  [ Luis Henriques ]

  * Release Tracking Bug
    - LP: #1522858

  [ Upstream Kernel Changes ]

  * Revert "dm: fix AB-BA deadlock in __dm_destroy()"
    - LP: #1522766
  * dm: fix AB-BA deadlock in __dm_destroy()
    - LP: #1522766

linux (3.13.0-72.115) trusty; urgency=low

  [ Luis Henriques ]

  * Release Tracking Bug
    - LP: #1521979

  [ Andy Whitcroft ]

  * [Packaging] control -- make element ordering deterministic
    - LP: #1516686
  * [Packaging] control -- prepare for new kernel-wedge semantics
    - LP: #1516686
  * [Tests] rebuild -- fix up rebuild test
    - LP: #1516686
  * [Debian] rebuild should only trigger for non-linux packages
    - LP: #1498862, #1516686
  * [Tests] gcc-multilib does not exist on ppc64el
    - LP: #1515541

  [ Craig Magina ]

  * [Config] Enable USB for arm64
    - LP: #1514971

  [ Duc Dang ]

  * SAUCE: (noup) arm64: dts: Add USB nodes for APM X-Gene v1 platforms
    - LP: #1514971

  [ Joseph Salisbury ]

  * SAUCE: scsi_sysfs: protect against double execution of
    __scsi_remove_device()
    - LP: #1509029

  [ Upstream Kernel Changes ]

  * Revert "ARM64: unwind: Fix PC calculation"
    - LP: #1520264
  * [SCSI] hpsa: allow SCSI mid layer to handle unit attention
    - LP: #1512415
  * usb: make xhci platform driver use 64 bit or 32 bit DMA
    - LP: #1514971
  * usb: Add support for ACPI identification to xhci-platform
    - LP: #1514971
  * xhci: Workaround to get Intel xHCI reset working more reliably
  * isdn_ppp: Add checks for allocation failure in isdn_ppp_open()
    - LP: #1520264
  * ppp, slip: Validate VJ compression slot parameters completely
    - LP: #1520264
  * staging/dgnc: fix info leak in ioctl
    - LP: #1520264
  * regmap: debugfs: Ensure we don't underflow when printing access masks
    - LP: #1520264
  * regmap: debugfs: Don't bother actually printing when calculating max
    length
    - LP: #1520264
  * tools lib traceevent: Fix string handling in heterogeneous arch
    environments
    - LP: #1520264
  * perf tools: Fix copying of /proc/kcore
    - LP: #1520264
  * ASoC: db1200: Fix DAI link format for db1300 and db1550
    - LP: #1520264
  * m68k: Define asmlinkage_protect
    - LP: #1520264
  * x86/xen: Support kexec/kdump in HVM guests by doing a soft reset
    - LP: #1520264
  * x86/xen: Do not clip xen_e820_map to xen_e820_map_entries when
    sanitizing map
    - LP: #1520264
  * UBI: return ENOSPC if no enough space available
    - LP: #1520264
  * s390/boot: fix boot of compressed kernel built with gcc 4.9
    - LP: #1520264
  * s390/boot/decompression: disable floating point in decompressor
    - LP: #1520264
  * MIPS: dma-default: Fix 32-bit fall back to GFP_DMA
    - LP: #1520264
  * drm/qxl: recreate the primary surface when the bo is not primary
    - LP: #1520264
  * genirq: Fix race in register_irq_proc()
    - LP: #1520264
  * KVM: nSVM: Check for NRIPS support before updating control field
    - LP: #1520264
  * Use WARN_ON_ONCE for missing X86_FEATURE_NRIPS
    - LP: #1520264
  * dm: fix AB-BA deadlock in __dm_destroy()
    - LP: #15202...

Read more...

Changed in linux (Ubuntu Trusty):
status: Fix Committed → Fix Released
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 4.2.0-21.25

---------------
linux (4.2.0-21.25) wily; urgency=low

  [ Luis Henriques ]

  * Release Tracking Bug
    - LP: #1522108

  [ Upstream Kernel Changes ]

  * staging/dgnc: fix info leak in ioctl
    - LP: #1509565
    - CVE-2015-7885
  * [media] media/vivid-osd: fix info leak in ioctl
    - LP: #1509564
    - CVE-2015-7884
  * KEYS: Fix race between key destruction and finding a keyring by name
    - LP: #1508856
    - CVE-2015-7872
  * KEYS: Fix crash when attempt to garbage collect an uninstantiated
    keyring
    - LP: #1508856
    - CVE-2015-7872
  * KEYS: Don't permit request_key() to construct a new keyring
    - LP: #1508856
    - CVE-2015-7872
  * isdn_ppp: Add checks for allocation failure in isdn_ppp_open()
    - LP: #1508329
    - CVE-2015-7799
  * ppp, slip: Validate VJ compression slot parameters completely
    - LP: #1508329
    - CVE-2015-7799

linux (4.2.0-20.24) wily; urgency=low

  [ Brad Figg ]

  * Release Tracking Bug
    - LP: #1521753

  [ Andy Whitcroft ]

  * [Tests] gcc-multilib does not exist on ppc64el
    - LP: #1515541

  [ Joseph Salisbury ]

  * SAUCE: scsi_sysfs: protect against double execution of
    __scsi_remove_device()
    - LP: #1509029

  [ Manoj Kumar ]

  * SAUCE: (noup) cxlflash: Fix to escalate LINK_RESET also on port 1
    - LP: #1513583

  [ Matthew R. Ochs ]

  * SAUCE: (noup) cxlflash: Fix to avoid virtual LUN failover failure
    - LP: #1513583

  [ Oren Givon ]

  * SAUCE: (noup) iwlwifi: Add new PCI IDs for the 8260 series
    - LP: #1517375

  [ Seth Forshee ]

  * [Config] CONFIG_DRM_AMDGPU_CIK=n
    - LP: #1510405

  [ Upstream Kernel Changes ]

  * net/mlx5e: Disable VLAN filter in promiscuous mode
    - LP: #1514861
  * drivers: net: xgene: fix RGMII 10/100Mb mode
    - LP: #1433290
  * HID: rmi: Disable scanning if the device is not a wake source
    - LP: #1515503
  * HID: rmi: Set F01 interrupt enable register when not set
    - LP: #1515503
  * net/mlx5e: Ethtool link speed setting fixes
    - LP: #1517919
  * scsi_scan: don't dump trace when scsi_prep_async_scan() is called twice
    - LP: #1517942
  * x86/ioapic: Disable interrupts when re-routing legacy IRQs
    - LP: #1508593
  * xhci: Workaround to get Intel xHCI reset working more reliably
  * megaraid_sas: Do not use PAGE_SIZE for max_sectors
    - LP: #1475166
  * net: usb: cdc_ether: add Dell DW5580 as a mobile broadband adapter
    - LP: #1513847
  * KVM: svm: unconditionally intercept #DB
    - LP: #1520184
    - CVE-2015-8104

 -- Luis Henriques <email address hidden> Wed, 02 Dec 2015 17:30:58 +0000

Changed in linux (Ubuntu Wily):
status: Fix Committed → Fix Released
Launchpad Janitor (janitor) wrote :
Download full text (14.4 KiB)

This bug was fixed in the package linux - 3.19.0-41.46

---------------
linux (3.19.0-41.46) vivid; urgency=low

  [ Luis Henriques ]

  * Release Tracking Bug
    - LP: #1522918

  [ Upstream Kernel Changes ]

  * Revert "dm: fix AB-BA deadlock in __dm_destroy()"
    - LP: #1522766
  * dm: fix AB-BA deadlock in __dm_destroy()
    - LP: #1522766

linux (3.19.0-40.45) vivid; urgency=low

  [ Luis Henriques ]

  * Release Tracking Bug
    - LP: #1522786

  [ Andy Whitcroft ]

  * [Packaging] control -- prepare for new kernel-wedge semantics
    - LP: #1516686
  * [Debian] rebuild should only trigger for non-linux packages
    - LP: #1498862, #1516686
  * [Tests] gcc-multilib does not exist on ppc64el
    - LP: #1515541

  [ Joseph Salisbury ]

  * SAUCE: scsi_sysfs: protect against double execution of
    __scsi_remove_device()
    - LP: #1509029

  [ Luis Henriques ]

  * [Config] updateconfigs after 3.19.8-ckt10 stable update

  [ Upstream Kernel Changes ]

  * Revert "ARM64: unwind: Fix PC calculation"
    - LP: #1520309
  * Revert "md: allow a partially recovered device to be hot-added to an
    array."
    - LP: #1520309
  * tty: fix stall caused by missing memory barrier in drivers/tty/n_tty.c
    - LP: #1512815
  * HID: rmi: Print the firmware id of the touchpad
    - LP: #1515503
  * HID: rmi: Add functions for writing to registers
    - LP: #1515503
  * HID: rmi: Disable scanning if the device is not a wake source
    - LP: #1515503
  * HID: rmi: Set F01 interrupt enable register when not set
    - LP: #1515503
  * be2net: log link status
    - LP: #1513980
  * xhci: Workaround to get Intel xHCI reset working more reliably
  * Drivers: hv: hv_balloon: refuse to balloon below the floor
    - LP: #1294283
  * Drivers: hv: hv_balloon: survive ballooning request with num_pages=0
    - LP: #1294283
  * Drivers: hv: hv_balloon: correctly handle val.freeram<num_pages case
    - LP: #1294283
  * Drivers: hv: hv_balloon: correctly handle num_pages>INT_MAX case
    - LP: #1294283
  * Drivers: hv: balloon: check if ha_region_mutex was acquired in
    MEM_CANCEL_ONLINE case
    - LP: #1294283
  * mm: meminit: make __early_pfn_to_nid SMP-safe and introduce
    meminit_pfn_in_nid
    - LP: #1294283
  * mm: meminit: inline some helper functions
    - LP: #1294283
  * mm, meminit: allow early_pfn_to_nid to be used during runtime
    - LP: #1294283
  * mm: initialize hotplugged pages as reserved
    - LP: #1294283
  * gut proc_register() a bit
    - LP: #1519106
  * arm: factor out mmap ASLR into mmap_rnd
    - LP: #1518483
  * x86: standardize mmap_rnd() usage
    - LP: #1518483
  * arm64: standardize mmap_rnd() usage
    - LP: #1518483
  * mips: extract logic for mmap_rnd()
    - LP: #1518483
  * powerpc: standardize mmap_rnd() usage
    - LP: #1518483
  * s390: standardize mmap_rnd() usage
    - LP: #1518483
  * mm: expose arch_mmap_rnd when available
    - LP: #1518483
  * s390: redefine randomize_et_dyn for ELF_ET_DYN_BASE
    - LP: #1518483
  * mm: split ET_DYN ASLR from mmap ASLR
    - LP: #1518483
  * mm: fold arch_randomize_brk into ARCH_HAS_ELF_RANDOMIZE
    - LP: #1518483
  * isdn_ppp: Add checks for allocation failure in isdn_ppp_open()
   ...

Changed in linux (Ubuntu Vivid):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers