Bonus for my theory that these timeouts are about the paths going down and not the disks. If I unplug the adapter (= 2 paths) it goes through this:
1. it immediately detects that something is wrong. I/O might still be queued.
Jan 25 09:27:52 s1lp5 multipathd[782]: checker failed path 65:32 in map mpatha Jan 25 09:27:52 s1lp5 multipathd[782]: mpatha: remaining active paths: 3 Jan 25 09:27:52 s1lp5 kernel: device-mapper: multipath: Failing path 65:32. Jan 25 09:27:53 s1lp5 multipathd[782]: checker failed path 8:176 in map mpathc Jan 25 09:27:53 s1lp5 multipathd[782]: mpathc: remaining active paths: 3 Jan 25 09:27:53 s1lp5 multipathd[782]: checker failed path 8:160 in map mpathd Jan 25 09:27:53 s1lp5 multipathd[782]: mpathd: remaining active paths: 3 Jan 25 09:27:53 s1lp5 kernel: device-mapper: multipath: Failing path 8:176. Jan 25 09:27:53 s1lp5 kernel: device-mapper: multipath: Failing path 8:160. Jan 25 09:27:54 s1lp5 multipathd[782]: checker failed path 8:208 in map mpatha Jan 25 09:27:54 s1lp5 multipathd[782]: mpatha: remaining active paths: 2 Jan 25 09:27:54 s1lp5 multipathd[782]: checker failed path 65:48 in map mpathe Jan 25 09:27:54 s1lp5 multipathd[782]: mpathe: remaining active paths: 3 Jan 25 09:27:54 s1lp5 multipathd[782]: checker failed path 8:240 in map mpathd Jan 25 09:27:54 s1lp5 multipathd[782]: mpathd: remaining active paths: 2 Jan 25 09:27:54 s1lp5 multipathd[782]: checker failed path 8:224 in map mpathe Jan 25 09:27:54 s1lp5 multipathd[782]: mpathe: remaining active paths: 2 Jan 25 09:27:54 s1lp5 multipathd[782]: checker failed path 65:0 in map mpathc Jan 25 09:27:54 s1lp5 multipathd[782]: mpathc: remaining active paths: 2 Jan 25 09:27:54 s1lp5 kernel: device-mapper: multipath: Failing path 8:208. Jan 25 09:27:54 s1lp5 kernel: device-mapper: multipath: Failing path 65:48. Jan 25 09:27:54 s1lp5 kernel: device-mapper: multipath: Failing path 8:240. Jan 25 09:27:54 s1lp5 kernel: device-mapper: multipath: Failing path 8:224. Jan 25 09:27:54 s1lp5 kernel: device-mapper: multipath: Failing path 65:0.
state now: mpathb (36005076306ffd6b6000000000000240a) dm-3 IBM,2107900 size=10G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw `-+- policy='service-time 0' prio=50 status=active |- 0:0:0:1074413604 sdc 8:32 active ready running |- 0:0:1:1074413604 sdh 8:112 active ready running |- 1:0:1:1074413604 sdr 65:16 active i/o pending running `- 1:0:0:1074413604 sdm 8:192 active i/o pending running
2. after the timeouts for failing I/O (fast_io_fail_tmo) all queued is cancelled
state now: mpathb (36005076306ffd6b6000000000000240a) dm-3 IBM,2107900 size=10G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw `-+- policy='service-time 0' prio=50 status=active |- 0:0:0:1074413604 sdc 8:32 active ready running |- 0:0:1:1074413604 sdh 8:112 active ready running |- 1:0:1:1074413604 sdr 65:16 failed faulty running `- 1:0:0:1074413604 sdm 8:192 failed faulty running
3. finally once I hit my 60 second limit on the paths (dev_loss_tmo) they are considered dead
Jan 25 09:28:52 s1lp5 kernel: rport-1:0-0: blocked FC remote port time out: removing target and saving binding Jan 25 09:28:52 s1lp5 kernel: rport-1:0-1: blocked FC remote port time out: removing target and saving binding Jan 25 09:28:52 s1lp5 kernel: sd 1:0:0:1073889316: [sdk] Synchronizing SCSI cache Jan 25 09:28:52 s1lp5 kernel: sd 1:0:0:1073889316: [sdk] Synchronize Cache(10) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK Jan 25 09:28:52 s1lp5 kernel: sd 1:0:0:1073954852: [sdl] Synchronizing SCSI cache Jan 25 09:28:52 s1lp5 kernel: sd 1:0:0:1073954852: [sdl] Synchronize Cache(10) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK Jan 25 09:28:52 s1lp5 multipathd[782]: mpathd: load table [0 20971520 multipath 1 queue_if_no_path 1 alua 1 1 service-time 0 3 1 8:0 1 8:80 1 8:240 1] Jan 25 09:28:52 s1lp5 multipathd[782]: sync_map_state: failing sdp state 2 dmstate 2 Jan 25 09:28:52 s1lp5 multipathd[782]: sdk [8:160]: path removed from map mpathd Jan 25 09:28:52 s1lp5 kernel: device-mapper: multipath: Failing path 8:240. Jan 25 09:28:52 s1lp5 kernel: sd 1:0:1:1073889316: [sdp] Synchronizing SCSI cache Jan 25 09:28:52 s1lp5 kernel: sd 1:0:1:1073889316: [sdp] Synchronize Cache(10) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK Jan 25 09:28:52 s1lp5 kernel: scsi 1:0:0:1073889316: alua: Detached Jan 25 09:28:52 s1lp5 multipath[994777]: 8:240: cannot find block device Jan 25 09:28:52 s1lp5 multipath[994777]: 8:240: Empty device name Jan 25 09:28:52 s1lp5 multipath[994777]: 8:240: Empty device name ...
state now: mpathb (36005076306ffd6b6000000000000240a) dm-3 IBM,2107900 size=10G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw `-+- policy='service-time 0' prio=50 status=active |- 0:0:0:1074413604 sdc 8:32 active ready running `- 0:0:1:1074413604 sdh 8:112 active ready running
^^ nothing will be enqueued or checked anymore the paths are gone. IIUC this is what dev_loss_tmo is for.
4. If I later bring back the paths they are newly detected and re-added
Jan 25 09:30:24 s1lp5 kernel: scsi 1:0:0:1076052004: Direct-Access IBM 2107900 2700 PQ: 0 ANSI: 5 Jan 25 09:30:24 s1lp5 kernel: scsi 1:0:0:1076052004: alua: supports implicit TPGS Jan 25 09:30:24 s1lp5 kernel: scsi 1:0:0:1076052004: alua: device naa.6005076306ffd6b60000000000002423 port group 0 rel port 330 Jan 25 09:30:24 s1lp5 kernel: sd 1:0:0:1076052004: Attached scsi generic sg18 type 0 Jan 25 09:30:24 s1lp5 kernel: sd 1:0:0:1076052004: Power-on or device reset occurred Jan 25 09:30:24 s1lp5 kernel: scsi 1:0:0:1076117540: Direct-Access IBM 2107900 2700 PQ: 0 ANSI: 5 Jan 25 09:30:24 s1lp5 kernel: scsi 1:0:0:1076117540: alua: supports implicit TPGS Jan 25 09:30:24 s1lp5 kernel: scsi 1:0:0:1076117540: alua: device naa.6005076306ffd6b60000000000002424 port group 0 rel port 330 Jan 25 09:30:24 s1lp5 kernel: sd 1:0:0:1076117540: Attached scsi generic sg19 type 0 Jan 25 09:30:24 s1lp5 kernel: sd 1:0:0:1076117540: Power-on or device reset occurred Jan 25 09:30:24 s1lp5 kernel: sd 1:0:1:1074413604: alua: port group 00 state A preferred supports tolusnA Jan 25 09:30:24 s1lp5 kernel: sd 1:0:1:1076117540: alua: port group 00 state A preferred supports tolusnA Jan 25 09:30:24 s1lp5 kernel: sd 1:0:1:1073954852: alua: port group 00 state A preferred supports tolusnA Jan 25 09:30:24 s1lp5 kernel: sd 1:0:1:1074413604: [sdm] 20971520 512-byte logical blocks: (10.7 GB/10.0 GiB) Jan 25 09:30:24 s1lp5 kernel: sd 1:0:1:1073954852: [sdl] 20971520 512-byte logical blocks: (10.7 GB/10.0 GiB) Jan 25 09:30:24 s1lp5 kernel: sd 1:0:1:1076117540: [sdo] 134217728 512-byte logical blocks: (68.7 GB/64.0 GiB) Jan 25 09:30:24 s1lp5 kernel: sd 1:0:1:1074413604: [sdm] Write Protect is off Jan 25 09:30:24 s1lp5 kernel: sd 1:0:1:1074413604: [sdm] Mode Sense: ed 00 00 08 Jan 25 09:30:24 s1lp5 kernel: sd 1:0:1:1073954852: [sdl] Write Protect is off Jan 25 09:30:24 s1lp5 kernel: sd 1:0:1:1076117540: [sdo] Write Protect is off Jan 25 09:30:24 s1lp5 kernel: sd 1:0:1:1073954852: [sdl] Mode Sense: ed 00 00 08 Jan 25 09:30:24 s1lp5 kernel: sd 1:0:1:1076117540: [sdo] Mode Sense: ed 00 00 08 Jan 25 09:30:24 s1lp5 kernel: sd 1:0:1:1074413604: [sdm] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA Jan 25 09:30:24 s1lp5 kernel: sd 1:0:1:1073954852: [sdl] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA Jan 25 09:30:24 s1lp5 kernel: sd 1:0:1:1076117540: [sdo] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA Jan 25 09:30:24 s1lp5 kernel: sd 1:0:1:1076052004: alua: port group 00 state A preferred supports tolusnA Jan 25 09:30:24 s1lp5 kernel: sd 1:0:1:1076052004: [sdn] 134217728 512-byte logical blocks: (68.7 GB/64.0 GiB) Jan 25 09:30:24 s1lp5 kernel: sd 1:0:1:1076052004: [sdn] Write Protect is off Jan 25 09:30:24 s1lp5 kernel: sd 1:0:1:1076052004: [sdn] Mode Sense: ed 00 00 08
state now: mpathb (36005076306ffd6b6000000000000240a) dm-3 IBM,2107900 size=10G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw `-+- policy='service-time 0' prio=50 status=active |- 0:0:0:1074413604 sdc 8:32 active ready running |- 0:0:1:1074413604 sdh 8:112 active ready running |- 1:0:1:1074413604 sdm 8:192 active ready running `- 1:0:0:1074413604 sdr 65:16 active ready running
Bonus for my theory that these timeouts are about the paths going down and not the disks.
If I unplug the adapter (= 2 paths) it goes through this:
1. it immediately detects that something is wrong. I/O might still be queued.
Jan 25 09:27:52 s1lp5 multipathd[782]: checker failed path 65:32 in map mpatha
Jan 25 09:27:52 s1lp5 multipathd[782]: mpatha: remaining active paths: 3
Jan 25 09:27:52 s1lp5 kernel: device-mapper: multipath: Failing path 65:32.
Jan 25 09:27:53 s1lp5 multipathd[782]: checker failed path 8:176 in map mpathc
Jan 25 09:27:53 s1lp5 multipathd[782]: mpathc: remaining active paths: 3
Jan 25 09:27:53 s1lp5 multipathd[782]: checker failed path 8:160 in map mpathd
Jan 25 09:27:53 s1lp5 multipathd[782]: mpathd: remaining active paths: 3
Jan 25 09:27:53 s1lp5 kernel: device-mapper: multipath: Failing path 8:176.
Jan 25 09:27:53 s1lp5 kernel: device-mapper: multipath: Failing path 8:160.
Jan 25 09:27:54 s1lp5 multipathd[782]: checker failed path 8:208 in map mpatha
Jan 25 09:27:54 s1lp5 multipathd[782]: mpatha: remaining active paths: 2
Jan 25 09:27:54 s1lp5 multipathd[782]: checker failed path 65:48 in map mpathe
Jan 25 09:27:54 s1lp5 multipathd[782]: mpathe: remaining active paths: 3
Jan 25 09:27:54 s1lp5 multipathd[782]: checker failed path 8:240 in map mpathd
Jan 25 09:27:54 s1lp5 multipathd[782]: mpathd: remaining active paths: 2
Jan 25 09:27:54 s1lp5 multipathd[782]: checker failed path 8:224 in map mpathe
Jan 25 09:27:54 s1lp5 multipathd[782]: mpathe: remaining active paths: 2
Jan 25 09:27:54 s1lp5 multipathd[782]: checker failed path 65:0 in map mpathc
Jan 25 09:27:54 s1lp5 multipathd[782]: mpathc: remaining active paths: 2
Jan 25 09:27:54 s1lp5 kernel: device-mapper: multipath: Failing path 8:208.
Jan 25 09:27:54 s1lp5 kernel: device-mapper: multipath: Failing path 65:48.
Jan 25 09:27:54 s1lp5 kernel: device-mapper: multipath: Failing path 8:240.
Jan 25 09:27:54 s1lp5 kernel: device-mapper: multipath: Failing path 8:224.
Jan 25 09:27:54 s1lp5 kernel: device-mapper: multipath: Failing path 65:0.
state now: 6b6000000000000 240a) dm-3 IBM,2107900 'service- time 0' prio=50 status=active
mpathb (36005076306ffd
size=10G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
`-+- policy=
|- 0:0:0:1074413604 sdc 8:32 active ready running
|- 0:0:1:1074413604 sdh 8:112 active ready running
|- 1:0:1:1074413604 sdr 65:16 active i/o pending running
`- 1:0:0:1074413604 sdm 8:192 active i/o pending running
2. after the timeouts for failing I/O (fast_io_fail_tmo) all queued is cancelled
state now: 6b6000000000000 240a) dm-3 IBM,2107900 'service- time 0' prio=50 status=active
mpathb (36005076306ffd
size=10G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
`-+- policy=
|- 0:0:0:1074413604 sdc 8:32 active ready running
|- 0:0:1:1074413604 sdh 8:112 active ready running
|- 1:0:1:1074413604 sdr 65:16 failed faulty running
`- 1:0:0:1074413604 sdm 8:192 failed faulty running
3. finally once I hit my 60 second limit on the paths (dev_loss_tmo) they are considered dead
Jan 25 09:28:52 s1lp5 kernel: rport-1:0-0: blocked FC remote port time out: removing target and saving binding DID_NO_ CONNECT driverbyte= DRIVER_ OK DID_NO_ CONNECT driverbyte= DRIVER_ OK DID_NO_ CONNECT driverbyte= DRIVER_ OK
Jan 25 09:28:52 s1lp5 kernel: rport-1:0-1: blocked FC remote port time out: removing target and saving binding
Jan 25 09:28:52 s1lp5 kernel: sd 1:0:0:1073889316: [sdk] Synchronizing SCSI cache
Jan 25 09:28:52 s1lp5 kernel: sd 1:0:0:1073889316: [sdk] Synchronize Cache(10) failed: Result: hostbyte=
Jan 25 09:28:52 s1lp5 kernel: sd 1:0:0:1073954852: [sdl] Synchronizing SCSI cache
Jan 25 09:28:52 s1lp5 kernel: sd 1:0:0:1073954852: [sdl] Synchronize Cache(10) failed: Result: hostbyte=
Jan 25 09:28:52 s1lp5 multipathd[782]: mpathd: load table [0 20971520 multipath 1 queue_if_no_path 1 alua 1 1 service-time 0 3 1 8:0 1 8:80 1 8:240 1]
Jan 25 09:28:52 s1lp5 multipathd[782]: sync_map_state: failing sdp state 2 dmstate 2
Jan 25 09:28:52 s1lp5 multipathd[782]: sdk [8:160]: path removed from map mpathd
Jan 25 09:28:52 s1lp5 kernel: device-mapper: multipath: Failing path 8:240.
Jan 25 09:28:52 s1lp5 kernel: sd 1:0:1:1073889316: [sdp] Synchronizing SCSI cache
Jan 25 09:28:52 s1lp5 kernel: sd 1:0:1:1073889316: [sdp] Synchronize Cache(10) failed: Result: hostbyte=
Jan 25 09:28:52 s1lp5 kernel: scsi 1:0:0:1073889316: alua: Detached
Jan 25 09:28:52 s1lp5 multipath[994777]: 8:240: cannot find block device
Jan 25 09:28:52 s1lp5 multipath[994777]: 8:240: Empty device name
Jan 25 09:28:52 s1lp5 multipath[994777]: 8:240: Empty device name
...
state now: 6b6000000000000 240a) dm-3 IBM,2107900 'service- time 0' prio=50 status=active
mpathb (36005076306ffd
size=10G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
`-+- policy=
|- 0:0:0:1074413604 sdc 8:32 active ready running
`- 0:0:1:1074413604 sdh 8:112 active ready running
^^ nothing will be enqueued or checked anymore the paths are gone.
IIUC this is what dev_loss_tmo is for.
4. If I later bring back the paths they are newly detected and re-added
Jan 25 09:30:24 s1lp5 kernel: scsi 1:0:0:1076052004: Direct-Access IBM 2107900 2700 PQ: 0 ANSI: 5 fd6b60000000000 002423 port group 0 rel port 330 fd6b60000000000 002424 port group 0 rel port 330
Jan 25 09:30:24 s1lp5 kernel: scsi 1:0:0:1076052004: alua: supports implicit TPGS
Jan 25 09:30:24 s1lp5 kernel: scsi 1:0:0:1076052004: alua: device naa.6005076306f
Jan 25 09:30:24 s1lp5 kernel: sd 1:0:0:1076052004: Attached scsi generic sg18 type 0
Jan 25 09:30:24 s1lp5 kernel: sd 1:0:0:1076052004: Power-on or device reset occurred
Jan 25 09:30:24 s1lp5 kernel: scsi 1:0:0:1076117540: Direct-Access IBM 2107900 2700 PQ: 0 ANSI: 5
Jan 25 09:30:24 s1lp5 kernel: scsi 1:0:0:1076117540: alua: supports implicit TPGS
Jan 25 09:30:24 s1lp5 kernel: scsi 1:0:0:1076117540: alua: device naa.6005076306f
Jan 25 09:30:24 s1lp5 kernel: sd 1:0:0:1076117540: Attached scsi generic sg19 type 0
Jan 25 09:30:24 s1lp5 kernel: sd 1:0:0:1076117540: Power-on or device reset occurred
Jan 25 09:30:24 s1lp5 kernel: sd 1:0:1:1074413604: alua: port group 00 state A preferred supports tolusnA
Jan 25 09:30:24 s1lp5 kernel: sd 1:0:1:1076117540: alua: port group 00 state A preferred supports tolusnA
Jan 25 09:30:24 s1lp5 kernel: sd 1:0:1:1073954852: alua: port group 00 state A preferred supports tolusnA
Jan 25 09:30:24 s1lp5 kernel: sd 1:0:1:1074413604: [sdm] 20971520 512-byte logical blocks: (10.7 GB/10.0 GiB)
Jan 25 09:30:24 s1lp5 kernel: sd 1:0:1:1073954852: [sdl] 20971520 512-byte logical blocks: (10.7 GB/10.0 GiB)
Jan 25 09:30:24 s1lp5 kernel: sd 1:0:1:1076117540: [sdo] 134217728 512-byte logical blocks: (68.7 GB/64.0 GiB)
Jan 25 09:30:24 s1lp5 kernel: sd 1:0:1:1074413604: [sdm] Write Protect is off
Jan 25 09:30:24 s1lp5 kernel: sd 1:0:1:1074413604: [sdm] Mode Sense: ed 00 00 08
Jan 25 09:30:24 s1lp5 kernel: sd 1:0:1:1073954852: [sdl] Write Protect is off
Jan 25 09:30:24 s1lp5 kernel: sd 1:0:1:1076117540: [sdo] Write Protect is off
Jan 25 09:30:24 s1lp5 kernel: sd 1:0:1:1073954852: [sdl] Mode Sense: ed 00 00 08
Jan 25 09:30:24 s1lp5 kernel: sd 1:0:1:1076117540: [sdo] Mode Sense: ed 00 00 08
Jan 25 09:30:24 s1lp5 kernel: sd 1:0:1:1074413604: [sdm] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Jan 25 09:30:24 s1lp5 kernel: sd 1:0:1:1073954852: [sdl] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Jan 25 09:30:24 s1lp5 kernel: sd 1:0:1:1076117540: [sdo] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Jan 25 09:30:24 s1lp5 kernel: sd 1:0:1:1076052004: alua: port group 00 state A preferred supports tolusnA
Jan 25 09:30:24 s1lp5 kernel: sd 1:0:1:1076052004: [sdn] 134217728 512-byte logical blocks: (68.7 GB/64.0 GiB)
Jan 25 09:30:24 s1lp5 kernel: sd 1:0:1:1076052004: [sdn] Write Protect is off
Jan 25 09:30:24 s1lp5 kernel: sd 1:0:1:1076052004: [sdn] Mode Sense: ed 00 00 08
state now: 6b6000000000000 240a) dm-3 IBM,2107900 'service- time 0' prio=50 status=active
mpathb (36005076306ffd
size=10G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
`-+- policy=
|- 0:0:0:1074413604 sdc 8:32 active ready running
|- 0:0:1:1074413604 sdh 8:112 active ready running
|- 1:0:1:1074413604 sdm 8:192 active ready running
`- 1:0:0:1074413604 sdr 65:16 active ready running