QCA988X Wifi Card Not PCI Passing Through

Bug #1471583 reported by bill
This bug affects 1 person
Affects Status Importance Assigned to Milestone

Bug Description

CPU: Intel(R) Xeon(R) CPU E3-1265L v3 @ 2.50GHz
KVM: qemu-kvm-1.5.3-86.el7_1.2.x86_64
Kernel: 4.1.1-1.el7.elrepo.x86_64, and kernel-3.10.0-229.7.2.el7.x86_64
Host & Guest: CentOS 7.1
Using virt-manager-1.1.0-12.el7.noarch to create, configure, and start guest

I am trying to do a PCI passthrough of a QCA988X wifi card. It's a Doodle Labs military-grade 802.11ac miniPCI card, which uses the ath10k kernel driver. This card configures nicely on the host, and seems to pass through to the guest, but early in the boot of the guest it says "Unknown header type" at the wifi's bus address. And sure enough, lspci -vv on the host then shows:
        !!! Unknown header type 7f
        Kernel driver in use: vfio-pci

When the guest has booted, of course it shows as an Unclassified device. Host and guest must run at least kernel 4.0 so the wifi card's current firmware will load, and so that its driver comes with the kernel. I have both host and guest set up for the wifi card. I tried running kernel 3.10 in the host and passing through the PCI device, but same behavior.

I am passing through to the guest an Intel i350 ethernet card just fine, in fact I'm passing through two of its SR-IOV virt interfaces to the guest, so that works.

On the host, before I start the guest, the wifi card looks like this (lspci -vv):

0a:00.0 Network controller: Qualcomm Atheros QCA988x 802.11ac Wireless Network Adapter
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 43
        Region 0: Memory at f7000000 (64-bit, non-prefetchable) [size=2M]
        Expansion ROM at f7200000 [disabled] [size=64K]
        Capabilities: [40] Power Management version 2
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA PME(D0+,D1+,D2+,D3hot+,D3cold+)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [50] MSI: Enable+ Count=8/8 Maskable+ 64bit-
                Address: fee00618 Data: 0000
                Masking: 00000000 Pending: 00000000
        Capabilities: [70] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                        RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
                        MaxPayload 128 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend-
                LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s <4us, L1 <64us
                        ClockPM- Surprise- LLActRep- BwNot-
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Not Supported, TimeoutDis+, LTR-, OBFF Not Supported
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
                LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
                         EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
        Capabilities: [100 v1] Advanced Error Reporting
                UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
                CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
        Capabilities: [140 v1] Virtual Channel
                Caps: LPEVC=0 RefClk=100ns PATEntryBits=1
                Arb: Fixed- WRR32- WRR64- WRR128-
                Ctrl: ArbSelect=Fixed
                Status: InProgress-
                VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
                        Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
                        Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=01
                        Status: NegoPending- InProgress-
        Capabilities: [160 v1] Device Serial Number 00-00-00-00-00-00-00-00
        Kernel driver in use: ath10k_pci

Revision history for this message
Alex Williamson (alex-l-williamson) wrote :
Revision history for this message
Alex Williamson (alex-l-williamson) wrote :

IOW, add a line like this below the line added by the above patch:


Double check that vendor:device ID against 'lspci -nn', that's 168c:003c.

Revision history for this message
bill (merc1984) wrote :

It does sound exactly like what I'm seeing.

From http://www.gossamer-threads.com/lists/linux/kernel/2054846

"Yes. If you *re*start the VM . ... The first start (after reboot) was not a problem."

It seems clear that this problem began with kernel 3.13. I tried applying the backports ath10k to kernel 3.10, but the kernel didn't recognize it, or install put it in the wrong place or something. So I tried kernel 4.1 and the module that comes with it fails this way. From the listserv I should have thought this would be fixed by kernel 4.1, but then maybe my device is so new and this has to be device-specific?

Thanks for your instructions Alex, but I don't fully understand. I searched for an options line I could put in /etc/modprobe.d/blacklist.conf to prevent PCI bus reset for the module, but couldn't find anything. The "Unknown header type" happens very early in the boot of the guest so I don't see how fixing the guest module would help.

Maybe you're saying that I must compile the backports module with some patch in the above link and add the line you suggest? Only thing is I don't understand why the module install failed. And I don't know whether the host should have the patched ath10k module as well as the guest. Does the host need the module at all of PCI passthrough?

Revision history for this message
bill (merc1984) wrote :

Actually my card locks up *on* boot of the guest, not after its reboot.

This generation of Atheros cards requires firmware files: https://wireless.wiki.kernel.org/en/users/Drivers/ath10k/firmware
So kernel 3.10 (default with CentOS 7.1) is not an option; 3.11 is the minimum, and that doesn't allow AP mode which I need. So hell, I might as well go with ElRepo's ml kernel, unless you recommend otherwise.

I compile the ath10k module (from somewhere), and somehow supplant the default ath10k module that comes with the kernel.

Revision history for this message
Alex Williamson (alex-l-williamson) wrote :

The host kernel ath10k drivers and firmware are irrelevant. The change I'm asking about in comment #2 requires a patch and recompile to the host kernel.

Revision history for this message
bill (merc1984) wrote :

You mean 'a patch and recompile to the -guest- kernel'? Otherwise I'm confused.

Revision history for this message
Alex Williamson (alex-l-williamson) wrote :

No, the -host- kernel. The problem is that these Atheros WiFi chips do not come back when we do a PCI bus reset. These devices only offer two ways to reset the device, a power management reset and a PCI bus reset. The extent of a power management reset is poorly defined, so we tend to prefer a PCI bus reset. A PCI bus reset is a standard part of the PCI specification and the device is expected to return from reset and be accessible. These devices never recover from reset, resulting in the behavior you're seeing. I've been unsuccessful in contacting Qualcomm/Atheros regarding this problem, so we're effectively blacklisting the devices in the host kernel to disallow the PCI bus reset mechanism.

Revision history for this message
bill (merc1984) wrote :

I see. And now I also understand that the patch is to the *PCI* driver, not the ath10k driver.

I was busily trying to get the SRPM for ElRepo's 4.1 kernel to recompile for the guest, but it is not there. He has something called "kernel-ml-4.1.1-1.el7.elrepo.nosrc.rpm" which is incomplete, and so is completely baffling. There are no instructions; instructions are for sissies...

I was about to give up.

But since I now see the patch has to be applied to the host kernel, I have a chance at getting the 3.10 SRPM. I guess I'd compile the kernel with rpmbuild and then just graft the PCI module into the regular binary kernel.

Revision history for this message
bill (merc1984) wrote :

Oh this is not good. The current kernel (3.10.0-229.7.2.el7.x86_64) that comes with the current CentOS (7.1) does not have in quirks.c the preceeding or succeeding stanzas in the patch here: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/pci/quirks.c?id=c3e59ee4e76686b0c84ca8faa1011d10cd4ca1b8

Apparently that patch was designed for kernel 3.14, so I'd better not move forward with 3.10. Even kernel-plus is still at 3.10. And ElRepo's 4.1 kernel-ml doesn't come with a complete kernel SRPM. So I'm stuck now. I want to stay with el7 packages if possible. I'm running KVM so need a kernel compatible with that.

Revision history for this message
bill (merc1984) wrote :

I've figured out how to compile ElRepo's kernel-ml and make the change to pci's quirks. But now the KVM's VM won't even boot. It gives a popup with:
"Error starting domain: Unable to read from monitor: Connection reset by peer

Traceback (most recent call last):
  File "/usr/share/virt-manager/virtManager/asyncjob.py", line 89, in cb_wrapper
    callback(asyncjob, *args, **kwargs)
  File "/usr/share/virt-manager/virtManager/asyncjob.py", line 125, in tmpcb
    callback(*args, **kwargs)
  File "/usr/share/virt-manager/virtManager/domain.py", line 1393, in startup
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 966, in create
    if ret == -1: raise libvirtError ('virDomainCreate() failed', dom=self)
libvirtError: Unable to read from monitor: Connection reset by peer"

It only happens when I add the Atheros QCS988x PCI device.

No idea.

Revision history for this message
bill (merc1984) wrote :

Oh I see. It's because the path that was shared on the host is no longer available, apparently causing this weird error message.

0a:00.0 Network controller: Qualcomm Atheros QCA988x 802.11ac Wireless Network Adapter (rev ff) (prog-if ff)
        !!! Unknown header type 7f

So even earlier than PCI reset on the guest, my device on the host is getting jammed.

I guess it has to go back. Nobody knows what's wrong. This is a Doodle Labs ACE-DB-3.

Revision history for this message
bill (merc1984) wrote :

Yep, lack of interest here. The ACE-DB-3 (and probably all QCA988x) simply does not work with Linux. No more time for this.

Revision history for this message
Thomas Huth (th-huth) wrote :

Looking through old bug tickets... can you still reproduce this issue with the latest version of QEMU and the kernel? Or could we close this ticket nowadays?

Changed in qemu:
status: New → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for QEMU because there has been no activity for 60 days.]

Changed in qemu:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers