segfault upon reboot

Bug #1596579 reported by Eduardo on 2016-06-27
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
QEMU
Undecided
Unassigned

Bug Description

[ 31.167946] VFIO - User Level meta-driver version: 0.3
[ 34.969182] kvm: zapping shadow pages for mmio generation wraparound
[ 43.095077] vfio-pci 0000:1a:00.0: irq 50 for MSI/MSI-X
[166493.891331] perf interrupt took too long (2506 > 2500), lowering kernel.perf_event_max_sample_rate to 50000
[315765.858431] qemu-kvm[1385]: segfault at 0 ip (null) sp 00007ffe5430db18 error 14
[315782.002077] vfio-pci 0000:1a:00.0: transaction is not cleared; proceeding with reset anyway
[315782.910854] mptsas 0000:1a:00.0: Refused to change power state, currently in D3
[315782.911236] mptbase: ioc1: Initiating bringup
[315782.911238] mptbase: ioc1: WARNING - Unexpected doorbell active!
[315842.957613] mptbase: ioc1: ERROR - Failed to come READY after reset! IocState=f0000000
[315842.957670] mptbase: ioc1: WARNING - ResetHistory bit failed to clear!
[315842.957675] mptbase: ioc1: ERROR - Diagnostic reset FAILED! (ffffffffh)
[315842.957717] mptbase: ioc1: WARNING - NOT READY WARNING!
[315842.957720] mptbase: ioc1: ERROR - didn't initialize properly! (-1)
[315842.957890] mptsas: probe of 0000:1a:00.0 failed with error -1

The qemu-kvm segfault happens when I issue a reboot on the Windows VM. The card I have is:
1a:00.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068E PCI-Express Fusion-MPT SAS (rev ff)

I have two of these cards (bought with many years difference), exact same model, and they fail the same way. I'm using PCI passthrough on this card for access to the tape drive.
This is very easy to reproduce, so feel free to let me know what to try.
Kernel 3.10.0-327.18.2.el7.x86_64 (Centos 7.2.1511).
qemu-kvm-1.5.3-105.el7_2.4.x86_64
Reporting it here because of the segfault, but I guess I might have to open a bug report with mptbase as well?

Running the debuginfo qemu-kvm rpm and attaching with gdb might be interesting to get a backtrace from that segfault. Otherwise devices getting stuck in D3 often mean they didn't return from a reset correctly. Are we to assume that the VM died between the vfio-pci line and the mptbase line and the device was set to managed='yes' and therefore libvirt returned the device to the host driver? Please document your VM config and provide lspci -vvv info for the assigned device.

Eduardo (eduardok) wrote :
Download full text (3.7 KiB)

The VM is fine until I issue a reboot on the guest OS, in this case it happened right at [315765.858431]. That is, once I boot the host, the guest starts fine, I am able to use the tape drive fine, but when I reboot for whatever reason, I guess the segfault.

Is this enough or do you want the full dumpxml ?

    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x1a' slot='0x00' function='0x0'/>
      </source>
      <alias name='hostdev0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x08' function='0x0'/>
    </hostdev>

Makes sense to me what you've explained about libvirt returning to the host driver, but unfortunately I don't have enough knowledge to comment, sorry.

lspci -vvv:

1a:00.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068E PCI-Express Fusion-MPT SAS (rev 08)
        Subsystem: Hewlett-Packard Company SC44Ge Host Bus Adapter
        Physical Slot: 3
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 27
        Region 0: I/O ports at 2000 [disabled] [size=256]
        Region 1: Memory at 97a10000 (64-bit, non-prefetchable) [size=16K]
        Region 3: Memory at 97a00000 (64-bit, non-prefetchable) [size=64K]
        Expansion ROM at <ignored> [disabled]
        Capabilities: [50] Power Management version 2
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [68] Express (v1) Endpoint, MSI 00
                DevCap: MaxPayload 4096 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE- FLReset-
                DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported-
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 256 bytes, MaxReadReq 4096 bytes
                DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
                LnkCap: Port #0, Speed 2.5GT/s, Width x8, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
                        ClockPM- Surprise- LLActRep- BwNot-
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s, Width x8, TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt-
        Capabilities: [98] MSI: Enable- Count=1/1 Maskable- 64bit+
                Address: 0000000000000000 Data: 0000
        Capabilities: [b0] MSI-X: Enable+ Count=1 Masked-
                Vector table: BAR=1 offset=00002000
                PBA: BAR=1 offset=00003000
        Capabilities: [100 v1] Advanced Error Reporting
                UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
                UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- U...

Read more...

Eduardo (eduardok) wrote :
Eduardo (eduardok) wrote :
Eduardo (eduardok) wrote :

By all means, feel free to provide me instructions on how to debug this myself, so I can help others in the future, although I understand that can be more time consuming. If anyone would rather prefer talking on IRC, just let me know the network and channel. Thanks

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers