Comment 3 for bug 2029131

Revision history for this message
Roxana Nicolescu (roxanan) wrote :

Commit "440e01539fb10b3562c82f98634c688f4adc9ef1": "PCI: Unify delay handling for reset and resume" breaks the test.
It adds a change where during reset, it actually waits after the secondary bus is ready. Before this change, it would wait after the bridge, which would always return immediately because the bridge is always accessible while its secondary bus is reset. This can be also seen in the logs.
Before the change, there is no waiting during reset and before resume:
EEH: Reset without hotplug activity
i40e 0002:01:00.0 enP2p1s0f0: NIC Link is Up, 1000 Mbps Full Duplex, Flow Control: None
aacraid 0003:01:00.0: enabling device (0140 -> 0142)
EEH: Beginning: 'slot_reset'
PCI 0003:01:00.0#01fd: EEH: Invoking aacraid->slot_reset()
aacraid 0003:01:00.0: aacraid: PCI error - slot reset
PCI 0003:01:00.0#01fd: EEH: aacraid driver reports: 'recovered'
EEH: Finished:'slot_reset' with aggregate recovery state:'recovered'
EEH: Notify device driver to resume
EEH: Beginning: 'resume'

After this change:
EH: Reset without hotplug activity
EXT4-fs warning (device sda2): ext4_end_bio:311: I/O error 10 writing to inode 68944133 (offset 0 size 0 starting block 14398)
i40e 0002:01:00.0 enP2p1s0f0: NIC Link is Up, 1000 Mbps Full Duplex, Flow Control: None
aacraid 0003:01:00.0: not ready 1023ms after bus reset; waiting
aacraid 0003:01:00.0: not ready 2047ms after bus reset; waiting
aacraid 0003:01:00.0: not ready 4095ms after bus reset; waiting
aacraid 0003:01:00.0: not ready 8191ms after bus reset; waiting
aacraid 0003:01:00.0: not ready 16383ms after bus reset; waiting
aacraid 0003:01:00.0: not ready 32767ms after bus reset; waiting
ast 0004:02:00.0: Going to break: [mem 0x600c200000000-0x600c200ffffff]
ast 0004:02:00.0: eeh_dev_check_failure(0004:02:00.0) = 0
xhci_hcd 0005:01:00.0: Going to break: [mem 0x600c280000000-0x600c28000ffff 64bit]
xhci_hcd 0005:01:00.0: eeh_dev_check_failure(0005:01:00.0) = 1
aacraid 0003:01:00.0: not ready 65535ms after bus reset; giving up
aacraid 0003:01:00.0: enabling device (0140 -> 0142)
EEH: Beginning: 'slot_reset'
PCI 0003:01:00.0#01fd: EEH: Invoking aacraid->slot_reset()
aacraid 0003:01:00.0: aacraid: PCI error - slot reset
PCI 0003:01:00.0#01fd: EEH: aacraid driver reports: 'recovered'
EEH: Finished:'slot_reset' with aggregate recovery state:'recovered'
EEH: Notify device driver to resume
EEH: Beginning: 'resume'

So what I found is that it just takes longer to reset and then resume the device, so increasing the timer to 120s in the test makes it pass.
BUT, I struggle to understand why it worked before. I would assume if waiting is not properly done, the device would not show was recovered after. Trying to understand that.

Note: I am not sure where this read-only problem comes from, I think I managed to reproduce it once but then I could not.