[Possible Regression] powerpc/eeh:eeh-basic.sh failed on P9 with 5.4.0-156-generic

Bug #2029131 reported by Po-Hsu Lin
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
ubuntu-kernel-tests
New
Undecided
Unassigned

Bug Description

Issue found on P9 node baltar and P8 node dryden with 5.4.0-156-generic.

This issue does not exist with 5.4.0-155-generic

Manually tested on P9 node baltar, with:
* 5.4.0-155-generic with 5.4.0-155-generic source - PASSED (2 out of 2 attempts)
* 5.4.0-155-generic with 5.4.0-156-generic source - PASSED (1 out of 1 attempt)
* 5.4.0-156-generic with 5.4.0-155-generic source - FAILED (2 out of 2 attempts)
* 5.4.0-156-generic with 5.4.0-156-generic source - FAILED (2 out of 2 attempts)

Test output:
0002:01:00.0, Failed to recover!
Breaking
....
0003:01:00.0, Failed to recover!
Breaking
0004:02:00.0, Recovered after 0 seconds
....
0005:01:00.0, Failed to recover!
3 devices failed to recover (4 tested)

dmesg output (trimmed to fit launchpad restriction, please find attachment for the original output):
[ 239.575350] i40e 0002:01:00.0: Going to break: [mem 0x6008008000000-0x60080087fffff 64bit pref]
[ 239.587625] i40e 0002:01:00.0: eeh_dev_check_failure(0002:01:00.0) = 1
[ 239.587648] EEH: Recovering PHB#2-PE#0
[ 239.587668] EEH: PE location: Onboard LAN, PHB location: N/A
[ 239.587682] EEH: Frozen PHB#2-PE#0 detected
[ 239.587692] EEH: Call Trace:
[ 239.587710] EEH: [c00000000004cb6c] __eeh_send_failure_event+0x6c/0x120
[ 239.587726] EEH: [c000000000045824] eeh_dev_check_failure.part.0+0x294/0x5f0
[ 239.587745] EEH: [c000000000045d30] eeh_dev_check_write+0x140/0x1e0
[ 239.587761] EEH: [c00000000048cc68] __vfs_write+0x38/0x70
[ 239.587775] EEH: [c00000000048e00c] vfs_write+0x11c/0x2a0
[ 239.587788] EEH: [c000000000490c94] ksys_write+0x84/0x140
[ 239.587802] EEH: [c00000000000b378] system_call+0x5c/0x68
[ 239.587816] EEH: This PCI device has failed 1 times in the last hour and will be permanently disabled after 5000 failures.
[ 239.587817] EEH: Notify device drivers to shutdown
[ 239.587820] EEH: Beginning: 'error_detected(IO frozen)'
[ 239.587824] PCI 0002:01:00.0#0000: EEH: Invoking i40e->error_detected(IO frozen)
[ 239.587827] i40e 0002:01:00.0: i40e_pci_error_detected: error 2
[ 239.588948] i40e 0002:01:00.0: VSI seid 396 Rx ring 0 dissable timeout
[ 239.589205] i40e 0002:01:00.0: VSI seid 396 Rx ring 1 dissable timeout
...
[ 239.618060] i40e 0002:01:00.0: VSI seid 396 Rx ring 117 dissable timeout
[ 239.618305] i40e 0002:01:00.0: VSI seid 396 Rx ring 118 dissable timeout
[ 239.678139] i40e 0002:01:00.0: VSI seid 396 Tx ring 0 disable timeout
[ 239.688707] i40e 0002:01:00.0: VSI seid 400 Rx ring 383 dissable timeout
[ 239.746150] i40e 0002:01:00.0: VSI seid 400 Tx ring 383 disable timeout
[ 239.746322] PCI 0002:01:00.0#0000: EEH: i40e driver reports: 'need reset'
[ 239.746325] PCI 0002:01:00.1#0000: EEH: Invoking i40e->error_detected(IO frozen)
[ 239.746326] i40e 0002:01:00.1: i40e_pci_error_detected: error 2
[ 239.752006] i40e 0002:01:00.1: VSI seid 397 Rx ring 0 dissable timeout
[ 239.752243] i40e 0002:01:00.1: VSI seid 397 Rx ring 1 dissable timeout
[ 239.752479] i40e 0002:01:00.1: VSI seid 397 Rx ring 2 dissable timeout
...
[ 239.779652] i40e 0002:01:00.1: VSI seid 397 Rx ring 116 dissable timeout
[ 239.779889] i40e 0002:01:00.1: VSI seid 397 Rx ring 117 dissable timeout
[ 239.780126] i40e 0002:01:00.1: VSI seid 397 Rx ring 118 dissable timeout
[ 239.838150] i40e 0002:01:00.1: VSI seid 397 Tx ring 0 disable timeout
[ 239.848439] i40e 0002:01:00.1: VSI seid 401 Rx ring 383 dissable timeout
[ 239.850134] i40e 0002:01:00.3: ARQ: Unknown event 0x0000 ignored
[ 239.850139] i40e 0002:01:00.3: ARQ: Unknown event 0x0000 ignored
[ 239.850144] i40e 0002:01:00.3: ARQ: Unknown event 0x0000 ignored
...
[ 239.850762] i40e 0002:01:00.2: ARQ: Unknown event 0x0000 ignored
[ 239.850766] i40e 0002:01:00.2: ARQ: Unknown event 0x0000 ignored
[ 239.850770] i40e 0002:01:00.2: ARQ: Unknown event 0x0000 ignored
[ 239.906076] i40e 0002:01:00.1: VSI seid 401 Tx ring 383 disable timeout
[ 239.906192] PCI 0002:01:00.1#0000: EEH: i40e driver reports: 'need reset'
[ 239.906194] PCI 0002:01:00.2#0000: EEH: Invoking i40e->error_detected(IO frozen)
[ 239.906196] i40e 0002:01:00.2: i40e_pci_error_detected: error 2
[ 239.907037] i40e 0002:01:00.2: VSI seid 398 Rx ring 0 dissable timeout
[ 239.907271] i40e 0002:01:00.2: VSI seid 398 Rx ring 1 dissable timeout
[ 239.907506] i40e 0002:01:00.2: VSI seid 398 Rx ring 2 dissable timeout
...
[ 239.933716] i40e 0002:01:00.2: VSI seid 398 Rx ring 114 dissable timeout
[ 239.933950] i40e 0002:01:00.2: VSI seid 398 Rx ring 115 dissable timeout
[ 239.934185] i40e 0002:01:00.2: VSI seid 398 Rx ring 116 dissable timeout
[ 239.934419] i40e 0002:01:00.2: VSI seid 398 Rx ring 117 dissable timeout
[ 239.934653] i40e 0002:01:00.2: VSI seid 398 Rx ring 118 dissable timeout
[ 239.994162] i40e 0002:01:00.2: VSI seid 398 Tx ring 0 disable timeout
[ 240.004456] i40e 0002:01:00.2: VSI seid 402 Rx ring 383 dissable timeout
[ 240.062532] i40e 0002:01:00.2: VSI seid 402 Tx ring 383 disable timeout
[ 240.062644] PCI 0002:01:00.2#0000: EEH: i40e driver reports: 'need reset'
[ 240.062647] PCI 0002:01:00.3#0000: EEH: Invoking i40e->error_detected(IO frozen)
[ 240.062649] i40e 0002:01:00.3: i40e_pci_error_detected: error 2
[ 240.063943] i40e 0002:01:00.3: VSI seid 399 Rx ring 0 dissable timeout
[ 240.064178] i40e 0002:01:00.3: VSI seid 399 Rx ring 1 dissable timeout
[ 240.064412] i40e 0002:01:00.3: VSI seid 399 Rx ring 2 dissable timeout
...
[ 240.091741] i40e 0002:01:00.3: VSI seid 399 Rx ring 117 dissable timeout
[ 240.091979] i40e 0002:01:00.3: VSI seid 399 Rx ring 118 dissable timeout
[ 240.150216] i40e 0002:01:00.3: VSI seid 399 Tx ring 0 disable timeout
[ 240.160564] i40e 0002:01:00.3: VSI seid 403 Rx ring 383 dissable timeout
[ 240.218163] i40e 0002:01:00.3: VSI seid 403 Tx ring 383 disable timeout
[ 240.218272] PCI 0002:01:00.3#0000: EEH: i40e driver reports: 'need reset'
[ 240.218274] EEH: Finished:'error_detected(IO frozen)' with aggregate recovery state:'need reset'
[ 240.218277] EEH: Collect temporary log
[ 240.218342] EEH: of node=0002:01:00.0
[ 240.218344] EEH: PCI device/vendor: 15898086
[ 240.218346] EEH: PCI cmd/status register: 00100546
[ 240.218346] EEH: PCI-E capabilities and status follow:
[ 240.218355] EEH: PCI-E 00: 0002e010 10008ce4 00002150 00425083
[ 240.218361] EEH: PCI-E 10: 10830000 00000000 00000000 00000000
[ 240.218361] EEH: PCI-E 20: 00000000
[ 240.218362] EEH: PCI-E AER capability register set follows:
[ 240.218369] EEH: PCI-E AER 00: 14020001 00000000 00000000 00062030
[ 240.218375] EEH: PCI-E AER 10: 00000000 00002000 000001e0 00000000
[ 240.218380] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000
[ 240.218382] EEH: PCI-E AER 30: 00000000 00000000
[ 240.218384] EEH: of node=0002:01:00.1
[ 240.218385] EEH: PCI device/vendor: 15898086
[ 240.218387] EEH: PCI cmd/status register: 00100546
[ 240.218387] EEH: PCI-E capabilities and status follow:
[ 240.218395] EEH: PCI-E 00: 0002e010 10008ce4 00002150 00425083
[ 240.218401] EEH: PCI-E 10: 10830000 00000000 00000000 00000000
[ 240.218401] EEH: PCI-E 20: 00000000
[ 240.218402] EEH: PCI-E AER capability register set follows:
[ 240.218409] EEH: PCI-E AER 00: 14020001 00000000 00000000 00062030
[ 240.218415] EEH: PCI-E AER 10: 00000000 00002000 000001e0 00000000
[ 240.218420] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000
[ 240.218422] EEH: PCI-E AER 30: 00000000 00000000
[ 240.218424] EEH: of node=0002:01:00.2
[ 240.218425] EEH: PCI device/vendor: 15898086
[ 240.218427] EEH: PCI cmd/status register: 00100546
[ 240.218427] EEH: PCI-E capabilities and status follow:
[ 240.218435] EEH: PCI-E 00: 0002e010 10008ce4 00002150 00425083
[ 240.218441] EEH: PCI-E 10: 10830000 00000000 00000000 00000000
[ 240.218441] EEH: PCI-E 20: 00000000
[ 240.218442] EEH: PCI-E AER capability register set follows:
[ 240.218449] EEH: PCI-E AER 00: 14020001 00000000 00000000 00062030
[ 240.218454] EEH: PCI-E AER 10: 00000000 00002000 000001e0 00000000
[ 240.218460] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000
[ 240.218462] EEH: PCI-E AER 30: 00000000 00000000
[ 240.218463] EEH: of node=0002:01:00.3
[ 240.218465] EEH: PCI device/vendor: 15898086
[ 240.218467] EEH: PCI cmd/status register: 00100546
[ 240.218467] EEH: PCI-E capabilities and status follow:
[ 240.218475] EEH: PCI-E 00: 0002e010 10008ce4 00002150 00425083
[ 240.218480] EEH: PCI-E 10: 10830000 00000000 00000000 00000000
[ 240.218481] EEH: PCI-E 20: 00000000
[ 240.218482] EEH: PCI-E AER capability register set follows:
[ 240.218489] EEH: PCI-E AER 00: 14020001 00000000 00000000 00062030
[ 240.218495] EEH: PCI-E AER 10: 00000000 00002000 000001e0 00000000
[ 240.218500] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000
[ 240.218502] EEH: PCI-E AER 30: 00000000 00000000
[ 240.218503] PHB4 PHB#2 Diag-data (Version: 1)
[ 240.218504] brdgCtl: 00000002
[ 240.218505] RootSts: 00000040 00402000 a0830008 00100107 00002000
[ 240.218506] PhbSts: 0000001c00000000 0000001c00000000
[ 240.218507] Lem: 0000000100280000 0000000000000000 0000000100000000
[ 240.218508] PhbErr: 0000088000000000 0000008000000000 2148000098000240 a008400000000000
[ 240.218509] RxeArbErr: 4000200000000000 0000200000000000 00009fde30000000 0000000000000000
[ 240.218510] PblErr: 0000000001000000 0000000001000000 0000000000000000 0000000000000000
[ 240.218510] RegbErr: 0000004000000000 0000004000000000 8800003c00000000 0000000007011000
[ 240.218511] PE[000] A/B: a440002a01000000 8000000000000000
[ 240.218514] EEH: Reset without hotplug activity
[ 246.410542] i40e 0002:01:00.0: not ready 1023ms after bus reset; waiting
[ 247.466583] i40e 0002:01:00.0: not ready 2047ms after bus reset; waiting
[ 249.578691] i40e 0002:01:00.0: not ready 4095ms after bus reset; waiting
[ 253.930949] i40e 0002:01:00.0: not ready 8191ms after bus reset; waiting
[ 262.379518] i40e 0002:01:00.0: not ready 16383ms after bus reset; waiting
[ 279.532384] i40e 0002:01:00.0: not ready 32767ms after bus reset; waiting
[ 301.231575] aacraid 0003:01:00.0: Going to break: [mem 0x600c180000000-0x600c1800fffff 64bit]
[ 301.245067] aacraid 0003:01:00.0: eeh_dev_check_failure(0003:01:00.0) = 1
[ 314.349746] i40e 0002:01:00.0: not ready 65535ms after bus reset; giving up
[ 317.454022] i40e 0002:01:00.0: enabling device (0140 -> 0142)
[ 317.454029] i40e 0002:01:00.1: enabling device (0140 -> 0142)
[ 317.454035] i40e 0002:01:00.2: enabling device (0140 -> 0142)
[ 317.454040] i40e 0002:01:00.3: enabling device (0140 -> 0142)
[ 317.466091] EEH: Beginning: 'slot_reset'
[ 317.466100] PCI 0002:01:00.0#0000: EEH: Invoking i40e->slot_reset()
[ 317.466456] PCI 0002:01:00.0#0000: EEH: i40e driver reports: 'recovered'
[ 317.466459] PCI 0002:01:00.1#0000: EEH: Invoking i40e->slot_reset()
[ 317.466795] PCI 0002:01:00.1#0000: EEH: i40e driver reports: 'recovered'
[ 317.466797] PCI 0002:01:00.2#0000: EEH: Invoking i40e->slot_reset()
[ 317.467143] PCI 0002:01:00.2#0000: EEH: i40e driver reports: 'recovered'
[ 317.467145] PCI 0002:01:00.3#0000: EEH: Invoking i40e->slot_reset()
[ 317.467492] PCI 0002:01:00.3#0000: EEH: i40e driver reports: 'recovered'
[ 317.467494] EEH: Finished:'slot_reset' with aggregate recovery state:'recovered'
[ 317.467495] EEH: Notify device driver to resume
[ 317.467496] EEH: Beginning: 'resume'
[ 317.467498] PCI 0002:01:00.0#0000: EEH: Invoking i40e->resume()
[ 317.483355] i40e 0002:01:00.0 enP2p1s0f0: NIC Link is Down
[ 317.727000] PCI 0002:01:00.0#0000: EEH: i40e driver reports: 'none'
[ 317.727005] PCI 0002:01:00.1#0000: EEH: Invoking i40e->resume()
[ 317.986624] PCI 0002:01:00.1#0000: EEH: i40e driver reports: 'none'
[ 317.986629] PCI 0002:01:00.2#0000: EEH: Invoking i40e->resume()
[ 318.245813] PCI 0002:01:00.2#0000: EEH: i40e driver reports: 'none'
[ 318.245817] PCI 0002:01:00.3#0000: EEH: Invoking i40e->resume()
[ 318.505912] PCI 0002:01:00.3#0000: EEH: i40e driver reports: 'none'
[ 318.505915] EEH: Finished:'resume'
[ 318.505917] EEH: Recovery successful.
[ 318.505921] EEH: Recovering PHB#3-PE#1fd
[ 318.505940] EEH: PE location: Onboard SAS, PHB location: N/A
[ 318.505958] EEH: Frozen PHB#3-PE#1fd detected
[ 318.505965] EEH: Call Trace:
[ 318.505977] EEH: [c00000000004cb6c] __eeh_send_failure_event+0x6c/0x120
[ 318.505996] EEH: [c000000000045824] eeh_dev_check_failure.part.0+0x294/0x5f0
[ 318.506016] EEH: [c000000000045d30] eeh_dev_check_write+0x140/0x1e0
[ 318.506044] EEH: [c00000000048cc68] __vfs_write+0x38/0x70
[ 318.506061] EEH: [c00000000048e00c] vfs_write+0x11c/0x2a0
[ 318.506079] EEH: [c000000000490c94] ksys_write+0x84/0x140
[ 318.506097] EEH: [c00000000000b378] system_call+0x5c/0x68
[ 318.506114] EEH: This PCI device has failed 1 times in the last hour and will be permanently disabled after 5000 failures.
[ 318.506114] EEH: Notify device drivers to shutdown
[ 318.506116] EEH: Beginning: 'error_detected(IO frozen)'
[ 318.506118] PCI 0003:01:00.0#01fd: EEH: Invoking aacraid->error_detected(IO frozen)
[ 318.506121] aacraid 0003:01:00.0: aacraid: PCI error detected 2
[ 318.506173] sd 0:2:0:0: [sda] tag#728 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[ 318.506175] sd 0:2:0:0: [sda] tag#731 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[ 318.506180] sd 0:2:0:0: [sda] tag#912 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[ 318.506182] sd 0:2:0:0: [sda] tag#731 CDB: Write(10) 2a 00 00 01 bd 70 00 01 00 00
[ 318.506185] blk_update_request: I/O error, dev sda, sector 114032 op 0x1:(WRITE) flags 0x800 phys_seg 2 prio class 0
[ 318.506188] sd 0:2:0:0: [sda] tag#728 CDB: Synchronize Cache(10) 35 00 00 00 00 00 00 00 00 00
[ 318.506190] sd 0:2:0:0: [sda] tag#912 CDB: Write(10) 2a 00 69 80 99 00 00 02 00 00
[ 318.506195] blk_update_request: I/O error, dev sda, sector 1770035456 op 0x1:(WRITE) flags 0x4800 phys_seg 4 prio class 0
[ 318.506211] sd 0:2:0:0: [sda] tag#913 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[ 318.506215] sd 0:2:0:0: [sda] tag#913 CDB: Write(10) 2a 00 69 80 9b 00 00 02 00 00
[ 318.506219] blk_update_request: I/O error, dev sda, sector 1770035968 op 0x1:(WRITE) flags 0x4800 phys_seg 4 prio class 0
[ 318.506226] sd 0:2:0:0: [sda] tag#914 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[ 318.506232] EXT4-fs warning (device sda2): ext4_end_bio:311: I/O error 10 writing to inode 68944133 (offset 0 size 0 starting block 14286)
[ 318.506233] sd 0:2:0:0: [sda] tag#914 CDB: Write(10) 2a 00 69 80 9d 00 00 02 00 00
[ 318.506237] blk_update_request: I/O error, dev sda, sector 1770036480 op 0x1:(WRITE) flags 0x4800 phys_seg 4 prio class 0
[ 318.506244] sd 0:2:0:0: [sda] tag#915 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[ 318.506247] sd 0:2:0:0: [sda] tag#915 CDB: Write(10) 2a 00 69 80 9f 00 00 02 00 00
[ 318.506251] blk_update_request: I/O error, dev sda, sector 1770036992 op 0x1:(WRITE) flags 0x4800 phys_seg 4 prio class 0
[ 318.506257] sd 0:2:0:0: [sda] tag#916 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[ 318.506261] sd 0:2:0:0: [sda] tag#916 CDB: Write(10) 2a 00 69 80 a1 00 00 02 00 00
[ 318.506265] blk_update_request: I/O error, dev sda, sector 1770037504 op 0x1:(WRITE) flags 0x4800 phys_seg 4 prio class 0
[ 318.506278] blk_update_request: I/O error, dev sda, sector 0 op 0x1:(WRITE) flags 0x800 phys_seg 0 prio class 0
[ 318.506280] sd 0:2:0:0: [sda] tag#917 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[ 318.506284] sd 0:2:0:0: [sda] tag#917 CDB: Write(10) 2a 00 69 80 a3 00 00 02 00 00
[ 318.506287] blk_update_request: I/O error, dev sda, sector 1770038016 op 0x1:(WRITE) flags 0x4800 phys_seg 4 prio class 0
[ 318.506294] sd 0:2:0:0: [sda] tag#918 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[ 318.506298] sd 0:2:0:0: [sda] tag#918 CDB: Write(10) 2a 00 69 80 a5 00 00 02 00 00
[ 318.506301] blk_update_request: I/O error, dev sda, sector 1770038528 op 0x1:(WRITE) flags 0x4800 phys_seg 4 prio class 0
[ 318.506307] sd 0:2:0:0: [sda] tag#919 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[ 318.506311] sd 0:2:0:0: [sda] tag#919 CDB: Write(10) 2a 00 69 80 a7 00 00 02 00 00
[ 318.506314] blk_update_request: I/O error, dev sda, sector 1770039040 op 0x1:(WRITE) flags 0x800 phys_seg 4 prio class 0
[ 318.506320] Buffer I/O error on device sda2, logical block 11950
[ 318.506639] Buffer I/O error on device sda2, logical block 11951
[ 318.506664] Buffer I/O error on device sda2, logical block 11952
[ 318.506701] Buffer I/O error on device sda2, logical block 11953
[ 318.506738] Buffer I/O error on device sda2, logical block 11954
[ 318.506773] Buffer I/O error on device sda2, logical block 11955
[ 318.506818] Buffer I/O error on device sda2, logical block 11956
[ 318.506854] Buffer I/O error on device sda2, logical block 11957
[ 318.506882] PCI 0003:01:00.0#01fd: EEH: aacraid driver reports: 'need reset'
[ 318.506885] EEH: Finished:'error_detected(IO frozen)' with aggregate recovery state:'need reset'
[ 318.506890] Buffer I/O error on device sda2, logical block 11958
[ 318.506896] Buffer I/O error on device sda2, logical block 11959
[ 318.506928] EEH: Collect temporary log
[ 318.506966] EEH: of node=0003:01:00.0
[ 318.506968] EXT4-fs warning (device sda2): ext4_end_bio:311: I/O error 10 writing to inode 68944133 (offset 0 size 0 starting block 14334)
[ 318.506971] EEH: PCI device/vendor: 028d9005
[ 318.506974] EEH: PCI cmd/status register: 00100146
[ 318.506975] EEH: PCI-E capabilities and status follow:
[ 318.506980] EXT4-fs warning (device sda2): ext4_end_bio:311: I/O error 10 writing to inode 68944133 (offset 0 size 0 starting block 14558)
[ 318.506984] EEH: PCI-E 00: 00020010 000081a2 00002950 00437083
[ 318.506989] EXT4-fs warning (device sda2): ext4_end_bio:311: I/O error 10 writing to inode 68944133 (offset 0 size 0 starting block 14766)
[ 318.506993] EEH: PCI-E 10: 10830000 00000000 00000000 00000000
[ 318.506994] EEH: PCI-E 20: 00000000
[ 318.506995] EEH: PCI-E AER capability register set follows:
[ 318.507000] EXT4-fs warning (device sda2): ext4_end_bio:311: I/O error 10 writing to inode 68944133 (offset 0 size 0 starting block 275839745)
[ 318.507004] EEH: PCI-E AER 00: 30020001 00000000 00400000 00462030
[ 318.507009] EXT4-fs warning (device sda2): ext4_end_bio:311: I/O error 10 writing to inode 68944133 (offset 0 size 0 starting block 13854)
[ 318.507012] EEH: PCI-E AER 10: 00000000 0000e000 000001e0 00000000
[ 318.507018] EXT4-fs warning (device sda2): ext4_end_bio:311: I/O error 10 writing to inode 68944133 (offset 0 size 0 starting block 14862)
[ 318.507019] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000
[ 318.507022] EEH: PCI-E AER 30: 00000000 00000000
[ 318.507025] PHB4 PHB#3 Diag-data (Version: 1)
[ 318.507026] brdgCtl: 00000002
[ 318.507027] RootSts: 00000040 00402000 a0830008 00100107 00002000
[ 318.507028] PhbSts: 0000001c00000000 0000001c00000000
[ 318.507029] Lem: 0000000100280000 0000000000000000 0000000100000000
[ 318.507031] PhbErr: 0000088000000000 0000008000000000 2148000098000240 a008400000000000
[ 318.507032] RxeArbErr: 4000200000000000 0000200000000000 00009fde30000000 0000000000000000
[ 318.507033] PblErr: 0000000001000000 0000000001000000 0000000000000000 0000000000000000
[ 318.507035] RegbErr: 0000004000000000 0000004000000000 8800003c00000000 0000000007011000
[ 318.507038] EXT4-fs warning (device sda2): ext4_end_bio:311: I/O error 10 writing to inode 68944133 (offset 0 size 0 starting block 14894)
[ 318.507045] PE[1fd] A/B: a440002a01000000 8000000000000000
[ 318.507047] EXT4-fs warning (device sda2): ext4_end_bio:311: I/O error 10 writing to inode 68944133 (offset 0 size 0 starting block 14942)
[ 318.507052] EEH: Reset without hotplug activity
[ 318.507056] EXT4-fs warning (device sda2): ext4_end_bio:311: I/O error 10 writing to inode 68944133 (offset 0 size 0 starting block 14398)
[ 321.290107] i40e 0002:01:00.0 enP2p1s0f0: NIC Link is Up, 1000 Mbps Full Duplex, Flow Control: None
[ 324.717574] aacraid 0003:01:00.0: not ready 1023ms after bus reset; waiting
[ 325.773607] aacraid 0003:01:00.0: not ready 2047ms after bus reset; waiting
[ 327.917677] aacraid 0003:01:00.0: not ready 4095ms after bus reset; waiting
[ 332.269835] aacraid 0003:01:00.0: not ready 8191ms after bus reset; waiting
[ 340.718133] aacraid 0003:01:00.0: not ready 16383ms after bus reset; waiting
[ 357.358654] aacraid 0003:01:00.0: not ready 32767ms after bus reset; waiting
[ 362.872870] ast 0004:02:00.0: Going to break: [mem 0x600c200000000-0x600c200ffffff]
[ 362.872939] ast 0004:02:00.0: eeh_dev_check_failure(0004:02:00.0) = 0
[ 362.886464] xhci_hcd 0005:01:00.0: Going to break: [mem 0x600c280000000-0x600c28000ffff 64bit]
[ 362.899022] xhci_hcd 0005:01:00.0: eeh_dev_check_failure(0005:01:00.0) = 1
[ 392.175709] aacraid 0003:01:00.0: not ready 65535ms after bus reset; giving up
[ 395.279850] aacraid 0003:01:00.0: enabling device (0140 -> 0142)
[ 395.296599] EEH: Beginning: 'slot_reset'
[ 395.296604] PCI 0003:01:00.0#01fd: EEH: Invoking aacraid->slot_reset()
[ 395.296606] aacraid 0003:01:00.0: aacraid: PCI error - slot reset
[ 395.296669] PCI 0003:01:00.0#01fd: EEH: aacraid driver reports: 'recovered'
[ 395.296670] EEH: Finished:'slot_reset' with aggregate recovery state:'recovered'
[ 395.296671] EEH: Notify device driver to resume
[ 395.296672] EEH: Beginning: 'resume'
[ 395.296673] PCI 0003:01:00.0#01fd: EEH: Invoking aacraid->resume()
[ 427.457355] aacraid 0003:01:00.0: aacraid: PCI error - resume
[ 427.457403] PCI 0003:01:00.0#01fd: EEH: aacraid driver reports: 'none'
[ 427.457404] EEH: Finished:'resume'
[ 427.457405] EEH: Recovery successful.
[ 427.457408] EEH: Recovering PHB#5-PE#fd
[ 427.457427] EEH: PE location: Onboard USB, PHB location: N/A
[ 427.457444] EEH: Frozen PHB#5-PE#fd detected
[ 427.457451] EEH: Call Trace:
[ 427.457461] EEH: [c00000000004cb6c] __eeh_send_failure_event+0x6c/0x120
[ 427.457479] EEH: [c000000000045824] eeh_dev_check_failure.part.0+0x294/0x5f0
[ 427.457499] EEH: [c000000000045d30] eeh_dev_check_write+0x140/0x1e0
[ 427.457517] EEH: [c00000000048cc68] __vfs_write+0x38/0x70
[ 427.457534] EEH: [c00000000048e00c] vfs_write+0x11c/0x2a0
[ 427.457551] EEH: [c000000000490c94] ksys_write+0x84/0x140
[ 427.457568] EEH: [c00000000000b378] system_call+0x5c/0x68
[ 427.457584] EEH: This PCI device has failed 1 times in the last hour and will be permanently disabled after 5000 failures.
[ 427.457584] EEH: Notify device drivers to shutdown
[ 427.457585] EEH: Beginning: 'error_detected(IO frozen)'
[ 427.457586] PCI 0005:01:00.0#00fd: EEH: driver not EEH aware
[ 427.457587] EEH: Finished:'error_detected(IO frozen)' with aggregate recovery state:'none'
[ 427.457589] EEH: Collect temporary log
[ 427.457609] EEH: of node=0005:01:00.0
[ 427.457611] EEH: PCI device/vendor: 8241104c
[ 427.457613] EEH: PCI cmd/status register: 00100546
[ 427.457613] EEH: PCI-E capabilities and status follow:
[ 427.457622] EEH: PCI-E 00: 0002c010 00008fc3 00002050 00075c12
[ 427.457628] EEH: PCI-E 10: 10120000 00000000 00000000 00000000
[ 427.457629] EEH: PCI-E 20: 00000000
[ 427.457630] EEH: PCI-E AER capability register set follows:
[ 427.457637] EEH: PCI-E AER 00: 15020001 00000000 00000000 00462030
[ 427.457644] EEH: PCI-E AER 10: 00000000 00002000 000001e0 00000000
[ 427.457650] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000
[ 427.457652] EEH: PCI-E AER 30: 00000000 00000000
[ 427.457653] PHB4 PHB#5 Diag-data (Version: 1)
[ 427.457654] brdgCtl: 00000002
[ 427.457655] RootSts: 00000040 00402000 a0120008 00100107 00002000
[ 427.457656] PhbSts: 0000001c00000000 0000001c00000000
[ 427.457657] Lem: 0000000100280000 0000000000000000 0000000100000000
[ 427.457657] PhbErr: 0000088000000000 0000008000000000 2148000098000240 a008400000000000
[ 427.457658] RxeArbErr: 4000200000000000 0000200000000000 00009fde30000000 0000000000000000
[ 427.457659] PblErr: 0000000001000000 0000000001000000 0000000000000000 0000000000000000
[ 427.457660] RegbErr: 0040004000000000 0000004000000000 8800003c00000000 0000000007011000
[ 427.457662] PE[0fd] A/B: a440002a01000000 8000000000000000
[ 427.457663] EEH: Reset with hotplug activity
[ 427.457674] xhci_hcd 0005:01:00.0: remove, state 4
[ 427.457678] usb usb2: USB disconnect, device number 1
[ 427.458554] xhci_hcd 0005:01:00.0: USB bus 2 deregistered
[ 427.458563] xhci_hcd 0005:01:00.0: xHCI host controller not responding, assume dead
[ 427.458588] xhci_hcd 0005:01:00.0: remove, state 1
[ 427.458591] usb usb1: USB disconnect, device number 1
[ 427.458592] usb 1-3: USB disconnect, device number 2
[ 427.458594] usb 1-3.1: USB disconnect, device number 4
[ 427.586371] usb 1-4: USB disconnect, device number 3
[ 427.587217] xhci_hcd 0005:01:00.0: Host halt failed, -19
[ 427.587220] xhci_hcd 0005:01:00.0: Host not accessible, reset failed.
[ 427.587619] xhci_hcd 0005:01:00.0: USB bus 1 deregistered
[ 427.587663] pci 0005:01:00.0: Removing from iommu group 3
[ 432.509515] EEH: Sleep 5s ahead of complete hotplug
[ 437.745472] pci 0005:01:00.0: [104c:8241] type 00 class 0x0c0330
[ 437.745505] pci 0005:01:00.0: reg 0x10: [mem 0x600c280000000-0x600c28000ffff 64bit]
[ 437.745519] pci 0005:01:00.0: reg 0x18: [mem 0x600c280010000-0x600c280011fff 64bit]
[ 437.745558] pci 0005:01:00.0: BAR2 [mem size 0x00002000 64bit]: requesting alignment to 0x10000
[ 437.745626] pci 0005:01:00.0: supports D1 D2
[ 437.745628] pci 0005:01:00.0: PME# supported from D0 D1 D2 D3hot
[ 437.745774] pci 0005:01:00.0: can't claim BAR 2 [mem size 0x00002000 64bit]: no address assigned
[ 437.745782] pci 0005:01:00.0: BAR 0: assigned [mem 0x600c280000000-0x600c28000ffff 64bit]
[ 437.745794] pci 0005:01:00.0: BAR 2: assigned [mem 0x600c280010000-0x600c280011fff 64bit]
[ 437.745809] pci 0005:01 : [PE# fd] Secondary bus 0x0000000000000001 associated with PE#fd
[ 437.745891] pci 0005:01 : [PE# fd] Setting up 32-bit TCE table at 0..80000000
[ 437.749313] pci 0005:01 : [PE# fd] Setting up window#0 0..7ffffffffff pg=10000
[ 437.749320] pci 0005:01 : [PE# fd] Enabling 64-bit DMA bypass
[ 437.749325] pci 0005:00:00.0: PCI bridge to [bus 01]
[ 437.749331] pci 0005:00:00.0: bridge window [mem 0x600c280000000-0x600c2ffefffff]
[ 437.749423] pci 0005:01:00.0: enabling device (0140 -> 0142)
[ 437.750025] xhci_hcd 0005:01:00.0: xHCI Host Controller
[ 437.750036] xhci_hcd 0005:01:00.0: new USB bus registered, assigned bus number 1
[ 437.750185] xhci_hcd 0005:01:00.0: hcc params 0x0270f06d hci version 0x96 quirks 0x0000000004000000
[ 437.750705] xhci_hcd 0005:01:00.0: xHCI Host Controller
[ 437.750710] xhci_hcd 0005:01:00.0: new USB bus registered, assigned bus number 2
[ 437.750714] xhci_hcd 0005:01:00.0: Host supports USB 3.0 SuperSpeed
[ 437.750768] usb usb1: New USB device found, idVendor=1d6b, idProduct=0002, bcdDevice= 5.04
[ 437.750771] usb usb1: New USB device strings: Mfr=3, Product=2, SerialNumber=1
[ 437.750773] usb usb1: Product: xHCI Host Controller
[ 437.750775] usb usb1: Manufacturer: Linux 5.4.0-156-generic xhci-hcd
[ 437.750776] usb usb1: SerialNumber: 0005:01:00.0
[ 437.751507] hub 1-0:1.0: USB hub found
[ 437.751522] hub 1-0:1.0: 4 ports detected
[ 437.751770] usb usb2: We don't know the algorithms for LPM for this host, disabling LPM.
[ 437.751805] usb usb2: New USB device found, idVendor=1d6b, idProduct=0003, bcdDevice= 5.04
[ 437.751807] usb usb2: New USB device strings: Mfr=3, Product=2, SerialNumber=1
[ 437.751809] usb usb2: Product: xHCI Host Controller
[ 437.751811] usb usb2: Manufacturer: Linux 5.4.0-156-generic xhci-hcd
[ 437.751812] usb usb2: SerialNumber: 0005:01:00.0
[ 437.752312] hub 2-0:1.0: USB hub found
[ 437.752324] hub 2-0:1.0: 4 ports detected
[ 437.752586] EEH: Notify device driver to resume
[ 437.752588] EEH: Beginning: 'resume'
[ 437.752590] PCI 0005:01:00.0#00fd: EEH: driver not EEH aware
[ 437.752592] EEH: Finished:'resume'
[ 437.752594] EEH: Recovery successful.
[ 438.004898] usb 1-3: new high-speed USB device number 2 using xhci_hcd
[ 438.155653] usb 1-3: New USB device found, idVendor=0557, idProduct=7000, bcdDevice= 0.00
[ 438.155656] usb 1-3: New USB device strings: Mfr=0, Product=0, SerialNumber=0
[ 438.168473] hub 1-3:1.0: USB hub found
[ 438.169159] hub 1-3:1.0: 4 ports detected
[ 438.300909] usb 1-4: new high-speed USB device number 3 using xhci_hcd
[ 438.453070] usb 1-4: New USB device found, idVendor=0451, idProduct=80ff, bcdDevice= 1.01
[ 438.453073] usb 1-4: New USB device strings: Mfr=0, Product=0, SerialNumber=0
[ 438.533380] usb 1-3.1: new low-speed USB device number 4 using xhci_hcd
[ 438.651162] usb 1-3.1: New USB device found, idVendor=0557, idProduct=2419, bcdDevice= 1.00
[ 438.651165] usb 1-3.1: New USB device strings: Mfr=0, Product=0, SerialNumber=0
[ 438.679538] input: HID 0557:2419 as /devices/pci0005:00/0005:00:00.0/0005:01:00.0/usb1/1-3/1-3.1/1-3.1:1.0/0003:0557:2419.0003/input/input2
[ 438.738845] hid-generic 0003:0557:2419.0003: input,hidraw0: USB HID v1.00 Keyboard [HID 0557:2419] on usb-0005:01:00.0-3.1/input0
[ 438.744869] input: HID 0557:2419 as /devices/pci0005:00/0005:00:00.0/0005:01:00.0/usb1/1-3/1-3.1/1-3.1:1.1/0003:0557:2419.0004/input/input3
[ 438.746338] hid-generic 0003:0557:2419.0004: input,hidraw1: USB HID v1.00 Mouse [HID 0557:2419] on usb-0005:01:00.0-3.1/input1

When this happens, the system will be read-only and thus cause test interruption.

Revision history for this message
Po-Hsu Lin (cypressyew) wrote :
Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

Here is the dmesg output with -155.

It seems that those EEH related failures can be found with -155 as well. But I am not sure why the test didn't fail.

I noticed that on the second attempt of 5.4.0-155-generic with 5.4.0-155-generic source, test has passed but the system still dropped into a read-only state.

Revision history for this message
Roxana Nicolescu (roxanan) wrote :

Commit "440e01539fb10b3562c82f98634c688f4adc9ef1": "PCI: Unify delay handling for reset and resume" breaks the test.
It adds a change where during reset, it actually waits after the secondary bus is ready. Before this change, it would wait after the bridge, which would always return immediately because the bridge is always accessible while its secondary bus is reset. This can be also seen in the logs.
Before the change, there is no waiting during reset and before resume:
EEH: Reset without hotplug activity
i40e 0002:01:00.0 enP2p1s0f0: NIC Link is Up, 1000 Mbps Full Duplex, Flow Control: None
aacraid 0003:01:00.0: enabling device (0140 -> 0142)
EEH: Beginning: 'slot_reset'
PCI 0003:01:00.0#01fd: EEH: Invoking aacraid->slot_reset()
aacraid 0003:01:00.0: aacraid: PCI error - slot reset
PCI 0003:01:00.0#01fd: EEH: aacraid driver reports: 'recovered'
EEH: Finished:'slot_reset' with aggregate recovery state:'recovered'
EEH: Notify device driver to resume
EEH: Beginning: 'resume'

After this change:
EH: Reset without hotplug activity
EXT4-fs warning (device sda2): ext4_end_bio:311: I/O error 10 writing to inode 68944133 (offset 0 size 0 starting block 14398)
i40e 0002:01:00.0 enP2p1s0f0: NIC Link is Up, 1000 Mbps Full Duplex, Flow Control: None
aacraid 0003:01:00.0: not ready 1023ms after bus reset; waiting
aacraid 0003:01:00.0: not ready 2047ms after bus reset; waiting
aacraid 0003:01:00.0: not ready 4095ms after bus reset; waiting
aacraid 0003:01:00.0: not ready 8191ms after bus reset; waiting
aacraid 0003:01:00.0: not ready 16383ms after bus reset; waiting
aacraid 0003:01:00.0: not ready 32767ms after bus reset; waiting
ast 0004:02:00.0: Going to break: [mem 0x600c200000000-0x600c200ffffff]
ast 0004:02:00.0: eeh_dev_check_failure(0004:02:00.0) = 0
xhci_hcd 0005:01:00.0: Going to break: [mem 0x600c280000000-0x600c28000ffff 64bit]
xhci_hcd 0005:01:00.0: eeh_dev_check_failure(0005:01:00.0) = 1
aacraid 0003:01:00.0: not ready 65535ms after bus reset; giving up
aacraid 0003:01:00.0: enabling device (0140 -> 0142)
EEH: Beginning: 'slot_reset'
PCI 0003:01:00.0#01fd: EEH: Invoking aacraid->slot_reset()
aacraid 0003:01:00.0: aacraid: PCI error - slot reset
PCI 0003:01:00.0#01fd: EEH: aacraid driver reports: 'recovered'
EEH: Finished:'slot_reset' with aggregate recovery state:'recovered'
EEH: Notify device driver to resume
EEH: Beginning: 'resume'

So what I found is that it just takes longer to reset and then resume the device, so increasing the timer to 120s in the test makes it pass.
BUT, I struggle to understand why it worked before. I would assume if waiting is not properly done, the device would not show was recovered after. Trying to understand that.

Note: I am not sure where this read-only problem comes from, I think I managed to reproduce it once but then I could not.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.