Assignment of VDEV Somtimes Fails using Intel QAT

Bug #1867220 reported by Joseph Salisbury
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux-azure (Ubuntu)
Undecided
Unassigned
Bionic
Undecided
Unassigned
Eoan
Undecided
Unassigned
linux-azure-4.15 (Ubuntu)
Undecided
Unassigned
Bionic
Undecided
Unassigned
Eoan
Undecided
Unassigned

Bug Description

The QAT is a PCIe device which supports SR-IOV. It is used to accelerate the crypto and compression operations by hardware.

There is additional info for this hardware on the Intel website:
https://www.intel.com/content/www/us/en/products/docs/network-io/ethernet/10-25-40-gigabit-adapters/quickassist-adapter-for-servers.html
https://01.org/intel-quick-assist-technology

On Ubuntu, when the QAT device is enabled with SR-IOV, one device will have 16 VFs.
When we assign the VDEV to Linux VM, sometimes, it will fail but sometimes it will not.

This issue was debugged and is resolved by applying three patches and removing some buggy code that was added as SAUCE. The following three patches are needed:

f73f8a504e27 PCI: hv: Use bytes 4 and 5 from instance ID as the PCI domain numbers
be700103efd1 PCI: hv: Detect and fix Hyper-V PCI domain number collision
6ae91579061c PCI: hv: Add __aligned(8) to struct retarget_msi_interrupt

Also, revert the patch Revert "PCI: hv: Make sure the bus domain is really unique", i.e. the line "if (list_empty(&hbus->children)) hbus->sysdata.domain = desc->ser" in new_pcichild_device() should be completely removed.

The patch “UBUNTU: SAUCE: pci-hyperv: Use only 16 bit integer for PCI domai” is also pointless now.

summary: - Assignment of VDEV Sometimes Fails
+ Assignment of VDEV Somtimes Fails using Intel QAT
Revision history for this message
Dexuan Cui (decui) wrote :

The bug applies to both linux-azure-5.0.0-1032 and linux-azure 4.15.0-1074.

Revision history for this message
Dexuan Cui (decui) wrote :

BTW, the bug also applies to hwe-4.15.0-91.92_16.04.1 and Ubuntu-hwe-5.0.0-37.40_18.04.

Revision history for this message
Marcelo Cerri (mhcerri) wrote :

Dexuan, does that affect 5.3 and higher?

Revision history for this message
Dexuan Cui (decui) wrote :

Hi Marcelo, I'm not sure which v5.3 kernel you mean -- the v5.3 in Ubuntu 19.10, v5.4 in Ubuntu 20.04 or the upstream stable tree's v5.3 and newer? :-)

Here we need to make sure the 3 patches in the Bug Description are included, and also make sure the line "if (list_empty(&hbus->children)) hbus->sysdata.domain = desc->ser" in new_pcichild_device() should be completely removed. BTW, this line never made it into the upstream kernel and it only appears in some Ubuntu versions (if not all).

Marcelo Cerri (mhcerri)
Changed in linux-azure (Ubuntu Bionic):
status: New → Invalid
Changed in linux-azure-4.15 (Ubuntu Eoan):
status: New → Invalid
Changed in linux-azure (Ubuntu Eoan):
status: New → In Progress
Changed in linux-azure-4.15 (Ubuntu Bionic):
status: New → In Progress
Marcelo Cerri (mhcerri)
Changed in linux-azure (Ubuntu Eoan):
status: In Progress → Fix Committed
Changed in linux-azure-4.15 (Ubuntu Bionic):
status: In Progress → Fix Committed
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (39.6 KiB)

This bug was fixed in the package linux-azure - 5.3.0-1022.23

---------------
linux-azure (5.3.0-1022.23) eoan; urgency=medium

  * eoan/linux-azure: 5.3.0-1022.23 -proposed tracker (LP: #1877945)

  [ Ubuntu: 5.3.0-53.47 ]

  * eoan/linux: 5.3.0-53.47 -proposed tracker (LP: #1877257)
  * Intermittent display blackouts on event (LP: #1875254)
    - drm/i915: Limit audio CDCLK>=2*BCLK constraint back to GLK only
  * Unable to handle kernel pointer dereference in virtual kernel address space
    on Eoan (LP: #1876645)
    - SAUCE: overlayfs: fix shitfs special-casing

linux-azure (5.3.0-1021.22) eoan; urgency=medium

  * eoan/linux-azure: 5.3.0-1021.22 -proposed tracker (LP: #1874742)

  * Packaging resync (LP: #1786013)
    - [Packaging] resync dkms-build and family
    - [Packaging] add libcap-dev dependency

  * Eoan update: upstream stable patchset 2020-04-08 (LP: #1871697)
    - [Config] azure: Update for NET_REDIRECT=y

  * getitimer returns it_value=0 erroneously (LP: #1349028)
    - [Config] azure: Annotate CONFIG_CONTEXT_TRACKING_FORCE=n

  * Assignment of VDEV Somtimes Fails using Intel QAT (LP: #1867220)
    - Revert "UBUNTU: SAUCE: pci-hyperv: Use only 16 bit integer for PCI domain"
    - Revert "Revert "PCI: hv: Make sure the bus domain is really unique""
    - PCI: hv: Detect and fix Hyper-V PCI domain number collision
    - PCI: hv: Use bytes 4 and 5 from instance ID as the PCI domain numbers

  [ Ubuntu: 5.3.0-52.46 ]

  * eoan/linux: 5.3.0-52.46 -proposed tracker (LP: #1874752)
  * alsa: make the dmic detection align to the mainline kernel-5.6
    (LP: #1871284)
    - ALSA: hda: add Intel DSP configuration / probe code
    - ALSA: hda: fix intel DSP config
    - ALSA: hda: Allow non-Intel device probe gracefully
    - ALSA: hda: More constifications
    - ALSA: hda: Rename back to dmic_detect option
    - [Config] SND_INTEL_DSP_CONFIG=m
    - [packaging] Remove snd-intel-nhlt from modules
  * built-using constraints preventing uploads (LP: #1875601)
    - temporarily drop Built-Using data
  * ubuntu/focal64 fails to mount Vagrant shared folders (LP: #1873506)
    - [Packaging] Move virtualbox modules to linux-modules
    - [Packaging] Remove vbox and zfs modules from generic.inclusion-list
  * linux-image-5.0.0-35-generic breaks checkpointing of container
    (LP: #1857257)
    - SAUCE: overlayfs: use shiftfs hacks only with shiftfs as underlay
  * shiftfs: broken shiftfs nesting (LP: #1872094)
    - SAUCE: shiftfs: record correct creator credentials
  * Add debian/rules targets to compile/run kernel selftests (LP: #1874286)
    - [Packaging] add support to compile/run selftests
  * shiftfs: O_TMPFILE reports ESTALE (LP: #1872757)
    - SAUCE: shiftfs: fix dentry revalidation
  * getitimer returns it_value=0 erroneously (LP: #1349028)
    - [Config] CONTEXT_TRACKING_FORCE policy should be unset
  * 5.3.0-46-generic - i915 - frequent GPU hangs / resets rcs0 (LP: #1872001)
    - drm/i915/execlists: Preempt-to-busy
    - drm/i915/gt: Detect if we miss WaIdleLiteRestore
    - drm/i915/execlists: Always force a context reload when rewinding RING_TAIL
  * alsa/sof: external mic can't be deteced on Lenovo and HP laptops
    (LP: #1...

Changed in linux-azure (Ubuntu Eoan):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (13.0 KiB)

This bug was fixed in the package linux-azure-4.15 - 4.15.0-1083.93

---------------
linux-azure-4.15 (4.15.0-1083.93) bionic; urgency=medium

  * bionic/linux-azure-4.15: 4.15.0-1083.93 -proposed tracker (LP: #1874773)

  * getitimer returns it_value=0 erroneously (LP: #1349028)
    - [Config] azure-4.15: Annotate CONFIG_CONTEXT_TRACKING_FORCE=n

  * Assignment of VDEV Somtimes Fails using Intel QAT (LP: #1867220)
    - Revert "Revert "PCI: hv: Make sure the bus domain is really unique""
    - Revert "PCI: hv: Use bytes 4 and 5 from instance ID as the PCI domain
      numbers"
    - PCI: hv: Add __aligned(8) to struct retarget_msi_interrupt
    - PCI: hv: Detect and fix Hyper-V PCI domain number collision
    - PCI: hv: Use bytes 4 and 5 from instance ID as the PCI domain numbers

  [ Ubuntu: 4.15.0-100.101 ]

  * bionic/linux: 4.15.0-100.101 -proposed tracker (LP: #1875878)
  * built-using constraints preventing uploads (LP: #1875601)
    - temporarily drop Built-Using data
  * Add debian/rules targets to compile/run kernel selftests (LP: #1874286)
    - [Packaging] add support to compile/run selftests
  * getitimer returns it_value=0 erroneously (LP: #1349028)
    - [Config] CONTEXT_TRACKING_FORCE policy should be unset
  * QEMU/KVM display is garbled when booting from kernel EFI stub due to missing
    bochs-drm module (LP: #1872863)
    - [Config] Enable CONFIG_DRM_BOCHS as module for all archs
  * Backport MPLS patches from 5.3 to 4.15 (LP: #1851446)
    - net/mlx5e: Report netdevice MPLS features
    - net: vlan: Inherit MPLS features from parent device
    - net: bonding: Inherit MPLS features from slave devices
    - net/mlx5e: Move to HW checksumming advertising
  * LIO hanging in iscsit_free_session and iscsit_stop_session (LP: #1871688)
    - scsi: target: remove boilerplate code
    - scsi: target: fix hang when multiple threads try to destroy the same iscsi
      session
    - scsi: target: iscsi: calling iscsit_stop_session() inside
      iscsit_close_session() has no effect
  * Add hw timestamps to received skbs in peak_canfd (LP: #1874124)
    - can: peak_canfd: provide hw timestamps in rx skbs
  * Bionic update: upstream stable patchset 2020-04-23 (LP: #1874502)
    - ARM: dts: sun8i-a83t-tbs-a711: HM5065 doesn't like such a high voltage
    - bus: sunxi-rsb: Return correct data when mixing 16-bit and 8-bit reads
    - net: vxge: fix wrong __VA_ARGS__ usage
    - hinic: fix a bug of waitting for IO stopped
    - hinic: fix wrong para of wait_for_completion_timeout
    - cxgb4/ptp: pass the sign of offset delta in FW CMD
    - qlcnic: Fix bad kzalloc null test
    - i2c: st: fix missing struct parameter description
    - firmware: arm_sdei: fix double-lock on hibernate with shared events
    - null_blk: Fix the null_add_dev() error path
    - null_blk: Handle null_add_dev() failures properly
    - null_blk: fix spurious IO errors after failed past-wp access
    - xhci: bail out early if driver can't accress host in resume
    - x86: Don't let pgprot_modify() change the page encryption bit
    - block: keep bdi->io_pages in sync with max_sectors_kb for stacked devices
    - irqchip/versatile-fpga: Handle chained IRQs properly
...

Changed in linux-azure-4.15 (Ubuntu Bionic):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (13.0 KiB)

This bug was fixed in the package linux-azure - 4.15.0-1083.93~16.04.1

---------------
linux-azure (4.15.0-1083.93~16.04.1) xenial; urgency=medium

  * xenial/linux-azure: 4.15.0-1083.93~16.04.1 -proposed tracker (LP: #1874771)

  [ Ubuntu: 4.15.0-1083.93 ]

  * bionic/linux-azure-4.15: 4.15.0-1083.93 -proposed tracker (LP: #1874773)
  * getitimer returns it_value=0 erroneously (LP: #1349028)
    - [Config] azure-4.15: Annotate CONFIG_CONTEXT_TRACKING_FORCE=n
  * Assignment of VDEV Somtimes Fails using Intel QAT (LP: #1867220)
    - Revert "Revert "PCI: hv: Make sure the bus domain is really unique""
    - Revert "PCI: hv: Use bytes 4 and 5 from instance ID as the PCI domain
      numbers"
    - PCI: hv: Add __aligned(8) to struct retarget_msi_interrupt
    - PCI: hv: Detect and fix Hyper-V PCI domain number collision
    - PCI: hv: Use bytes 4 and 5 from instance ID as the PCI domain numbers
  * bionic/linux: 4.15.0-100.101 -proposed tracker (LP: #1875878)
  * built-using constraints preventing uploads (LP: #1875601)
    - temporarily drop Built-Using data
  * Add debian/rules targets to compile/run kernel selftests (LP: #1874286)
    - [Packaging] add support to compile/run selftests
  * getitimer returns it_value=0 erroneously (LP: #1349028)
    - [Config] CONTEXT_TRACKING_FORCE policy should be unset
  * QEMU/KVM display is garbled when booting from kernel EFI stub due to missing
    bochs-drm module (LP: #1872863)
    - [Config] Enable CONFIG_DRM_BOCHS as module for all archs
  * Backport MPLS patches from 5.3 to 4.15 (LP: #1851446)
    - net/mlx5e: Report netdevice MPLS features
    - net: vlan: Inherit MPLS features from parent device
    - net: bonding: Inherit MPLS features from slave devices
    - net/mlx5e: Move to HW checksumming advertising
  * LIO hanging in iscsit_free_session and iscsit_stop_session (LP: #1871688)
    - scsi: target: remove boilerplate code
    - scsi: target: fix hang when multiple threads try to destroy the same iscsi
      session
    - scsi: target: iscsi: calling iscsit_stop_session() inside
      iscsit_close_session() has no effect
  * Add hw timestamps to received skbs in peak_canfd (LP: #1874124)
    - can: peak_canfd: provide hw timestamps in rx skbs
  * Bionic update: upstream stable patchset 2020-04-23 (LP: #1874502)
    - ARM: dts: sun8i-a83t-tbs-a711: HM5065 doesn't like such a high voltage
    - bus: sunxi-rsb: Return correct data when mixing 16-bit and 8-bit reads
    - net: vxge: fix wrong __VA_ARGS__ usage
    - hinic: fix a bug of waitting for IO stopped
    - hinic: fix wrong para of wait_for_completion_timeout
    - cxgb4/ptp: pass the sign of offset delta in FW CMD
    - qlcnic: Fix bad kzalloc null test
    - i2c: st: fix missing struct parameter description
    - firmware: arm_sdei: fix double-lock on hibernate with shared events
    - null_blk: Fix the null_add_dev() error path
    - null_blk: Handle null_add_dev() failures properly
    - null_blk: fix spurious IO errors after failed past-wp access
    - xhci: bail out early if driver can't accress host in resume
    - x86: Don't let pgprot_modify() change the page encryption bit
    - block: keep bdi->io_pages in sync with max_sector...

Changed in linux-azure (Ubuntu):
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers